fuzzix dot org :: How Easy are Perl's Pluggable Keywords?

How Easy are Perl's Pluggable Keywords?

6 Nov 2023

Introduction

Perl 5.12 introduced the pluggable keyword API, which allows modules to define custom "keyword-headed" expressions (that is, your new custom keyword must be concerned with what comes after it, not before).

The idea of modifying Perl to introduce new keywords and operators goes back further, to the introduction of source filters in Perl 5.6 (almost a quarter century ago). The use of source filters is discouraged these days, mainly due to "only perl can parse Perl" (that is, the language has a single implementation and some difficult-to-specify dynamic behaviours - there is no static analyser or compiler. (There are other Perl implementations out there, but they may feature various compromises meaning they cannot work with arbitrary Perl source, or they may no longer be maintained)).

Source filters are akin to C preprocessor macros, in that they sit between the source code and the parser - the parser receives fully modified source. Source filters frequently operate with regex match and replace operations, which may be error prone when applied to a complete source file. They may also have side effects depending on how they are implemented, such as clobbering your __DATA__ section.

Another approach, now deprecated, was Devel::Declare, which hooked into the parser to inject keywords and other functionality. Do not use this - it is naught but an interesting historical artifact.

The pluggable keyword API enables keyword creation at build-time, and can take complete advantage of the parser's lifecycle and language recognition features.

This post documents my own fumblings with pluggable keywords and works through an example keyword-based feature to explore the ins-and-outs of pluggable keywords (tl;dr: do not walk my path - I get close, but fail to attain cigar).

An Inaccurate Overview of the Keyword API.

We should begin with a swift overview of the API itself (swift so I can wave my hands a lot, and paper over the parts I don't understand with vague terms).

The entry point to the API is a function pointer called PL_keyword_plugin. When your custom keyword plugin loads, it should set this pointer to run your code when a potential keyword is encountered. Any existing value of the pointer should be stashed so you can call that extant function if you're not interested in the keyword. That is, you set up a chain of execution where each plugin calls the one loaded before until a keyword is recognised.

The function prototype is (mostly) (char *keyword_ptr, STRLEN keyword_len, OP **op_ptr). The first parameter is a pointer to the source code text at the point the keyword was encountered. The second parameter is the length of the keyword. The third is a pointer to the optree - a data structure representing the source code being parsed, analogous to an abstract syntax tree.

Your function should "consume" as much of the source as is appropriate, and update the optree with the data and functionality required to do the keyword's work.

The pluggable keyword API as designed requires use of XS, a language which acts as an interface between Perl and C. As I am not familiar enough with XS to develop a new feature with it, let's for now discount it as an easy option (remember the title of this post - we're chasing easy for now).

Doing it in Perl

Keyword::Simple offers a wrapper for the keyword API. It provides a define function which takes two parameters, the name of your keyword, and a callback to handle a reference to the source code from just after the point your keyword was used (that is, the keyword is already consumed). Your callback should modify this reference to inject new functionality into the source.

Immediately I think "Wait, isn't this just a source filter?".

Wait, isn't this just a source filter?

Not quite. A source filter acts on the entire source file and often has to guess at the correct context. Here the parser has recognised a keyword and passed that to the callback, so we can be fairly certain the context is appropriate.

This allows for simpler parsing in our callback, as we don't have to account for other syntax features and can focus just on what's expected in the new syntax - that is, we don't have to check if we are making changes in the middle of a heredoc or a comment or otherwise outside the desired context.

As with source filters, we do have new source code injected for the parser - expect possible side effects, such as line numbers in error messages no longer lining up with the source file (Though some authors take steps to preserve line numbers from the original source).

Also, as we'll see shortly, there are tools to help control the complexities of parsing your keyword's parameters.

OK, I guess

We can take a look at reverse dependencies of Keyword::Simple for some clues on how best to exploit its powers - taking one at random, PerlX::ScopeFunction.

Looking at the source, the major thing to note is that it makes use of PPR, Damian Conway's Perl Pattern Recogniser. While I noted before that "only perl can parse Perl", PPR can match patterns and build a tree from Perl source just as effectively. I'll link to a talk below where Damian outlines some of the advanced feats PPR pulls off, as well as some of the arcane magick involved in its construction.

Keyword::Declare builds upon the simple approach offered by Keyword::Simple with some PPR-powered niceties. New keywords are defined using keyword, which takes the name of your new keyword, a list of named PPR entities, and a block which should return your replacement source - the named entities are consumed from the source. That is, your returned string replaces the matched entities you listed in the declaration.

Example - Multiple Dispatch

This example will use Keyword::Declare to (partially) implement multiple dispatch, or multimethods. Multimethods allow for the definition of several variants of a method with the same identifier. Which variant to execute is selected at run-time, based on the data types of the parameters. Don't think traditional types which offer compiler cues for things like storage requirements, compile time validation of assignment, or casting. This is a run-time process which inspects the content of parameters to see what they look like. There is no casting or autoboxing or build-time validation.

Use of the new keyword multi should look something like:

class Multiplier {

    multi method multiply( Num $multiplier, Num $multiplicand ) {
        $multiplier * $multiplicand;
    }

    multi method multiply( Str $string, Int $multiplicand ) {
        $string x $multiplicand;
    }

    multi method multiply( OBJECT $obj, Int $multiplicand ) {
        map { $obj->clone } 1..$multiplicand;
    }

    ...
}

Here we have three different definitions for the multiply method, one of which may be executed at runtime depending on whether the first parameter is a number, string, or object instance. The second parameter should also be validated to be a number of the appropriate form.

Keyword definition

Since we have a fair idea of how the syntax will look, let's take a peek at a complete keword definition:

keyword multi (
    /sub|method/ $sub,
    Ident $method,
    Attributes? $attribs,
    List? $raw_params,
    Block $code
) {
    my $signature      = _extract_signature( $raw_params );
    my $param_string   = join ',', keys $signature->%*;
    my @types          = values $signature->%*;
    my $signature_name = join '_', map { $_ // 'undef' } @types;
    my $target_method  = "_multimethod_${signature_name}_$method";

    $methods->{ $class }->{ $method } //= [];
    push $methods->{ $class }->{ $method }->@*, {
        signature => $signature, method => $target_method
    };

    _build_type_checkers( @types );
    _inject_proxy_method( $class, $method );

    "$sub $target_method $attribs ( $param_string ) $code";
}

This keyword should expect to be placed before a sub or method declaration, consisting of an identifier, optional attributes, an optional list of expected parameters, then a block of code (it occurs to me now reading this that we could likely leave the code block out of the keyword definition - it is not changed at all).

As our param list also contains new and unique syntax, we'll need to unroll that ourselves in _extract_signature:

sub _extract_signature( $raw_params ) {
    $raw_params =~ s/^\(//;
    $raw_params =~ s/\)$//;
    +{
        map {
            my @param = split " ", $_;
            $param[1]
                ? ( $param[1] => $param[0] )
                : ( $param[0] => undef )
        } split ",", $raw_params
    }
}

This removes the parentheses from the parameter list text, then builds a hash of parameter => type. For example, the method declaration:

multi method hello( Int $foo, $bar );

...would result in the hash:

{
    $foo => 'Int',
    $bar => undef
}

Returning to the keyword definition, the next lines look like so:

    my $param_string   = join ',', keys $signature->%*;
    my @types          = values $signature->%*;
    my $signature_name = join '_', map { $_ // 'undef' } @types;
    my $target_method  = "_multimethod_${signature_name}_$method";

The $param_string variable will become useful when returning text from the keyword definition block. The array @types will contain a list of the types extracted from the signature in the order they were encountered. This is used to build a unique method name, $target_method, which acts as a means to hack our multimethod definitions into the package/class namespace, where method names must be unique.

Let's return to the keyword definition again - the next lines are:

    die "Ambiguous signature in $method declaration"
        if $class->can( $target_method );

    $methods->{ $class }->{ $method } //= [];
    push $methods->{ $class }->{ $method }->@*, {
        signature => $signature, method => $target_method
    };

Firstly a quick validation is performed - has a multi method with this signature already been declared?

The $methods hashref is a file scoped stash for info about the current method - the method's signature hash and $target_method name. Next:

  _build_type_checkers( @types );
  _inject_proxy_method( $class, $method );

We start with _build_type_checkers ...

sub _build_type_checkers( @types ) {
    for my $type ( uniq sort grep { $_ } @types ) {
        next if $checkers->{ $type };
        $checkers->{ $type } = sub( $datum ) {
            state $ts_type = Types::Standard->can( $type );
            $ts_type
                ? $ts_type->()->assert_valid( $datum )
                : blessed $datum && $datum->isa( $type )
                    or die("$datum is not an instance of $type");
        };
    }
}

This function maintains another file-scoped stash, $checkers, which contiains validation coderefs to check passed data against the named type. If the type is supported by Types::Standard, its assert_valid method is used, otherwise the type is considered an isa check - is this an instance of a specified object?

Next we _inject_proxy_method ...

sub _inject_proxy_method( $class, $method ) {
    return if $class->can( $method );

    my $meta = Object::Pad::MOP::Class->for_class( $class );
    $meta->add_method(
        "$method",
        sub {
            Multimethod::delegate( $class, $method, @_ )
        }
    );
}

Here we make use of the experimental Object::Pad MOP (meta-object protocol). A MOP is an API which allows for inspection and modification of a class' features - class members, methods, roles, inheritance tree, and so on. The API opens up class internals and allows definition of new roles and classes dynamically.

Note the explicit stringification of $method - this is an instance of Keyword::Declare::Arg.

The proxy method we inject into the class here is a redefinition of the multimethod name to delegate the selection of, and calling of the appropriate method to Multimethod::delegate, which looks like this:

sub delegate( $class, $method, $instance, @params ) {
    my $delegates = $methods->{ $class }->{ $method };
    my $delegate_method = _find_signature_match( $delegates, @params );
    die "No delegate method found for ${class}::$method" unless $delegate_method;
    $instance->$delegate_method( @params );
}

This starts by pulling the set of methods defined for the method name from the $methods stash (the potential delegates (noun) to delegate (verb) to - don't blame me, I only speak this language), then calls _find_signature_match to find the first matching method. If a matching method is found, it is called, otherwise the program bails out. The _find_signature_match function is as follows:

sub _find_signature_match( $delegates, @params ) {
    OUTER: for my $delegate ( $delegates->@* ) {
        my @types = values $delegate->{ signature }->%*;

        my $iter = each_array( @types, @params );
        while ( my ( $type, $param ) = $iter->() ) {
            next unless $type;
            try {
                $checkers->{ $type }->( $param );
            } catch( $e ) {
                next OUTER;
            }
        }

        return $delegate->{ method };
    }
}

This simply iterates over each delegate method and returns the first one for which the passed parameters pass all type checks stashed for that declaration, or nothing at all.

Looking back to the keyword definition, the final job is to return the now-uniquely-named method definition:

"$sub $target_method $attribs ( $param_string ) $code";

The code outlined here is run each time the keyword multi is encountered.

Trying it Out

Let's kick off a REPL:

$ reply
0> use Object::Pad
1> use Multimethod
Variable "$class" will not stay shared at lib/Multimethod.pm line 112.

An inauspicious start. Multimethod's import function looks as follows:

sub import {
    my $class = caller();

    keyword multi ( ...

The $class variable is used within keyword's block. As this is not an anonymous sub, it does not create a closure around $class. As the block may be called at any time, perl warns us of a potential pitfall - this block is working with a copy of $class. As we can be fairly certain the block is run now, this warning should be safe to ignore.

Also, it makes no sense to use Multimethod at this level - it needs a package name to work with in caller(). Let's start again:

$ reply
0> use v5.38.0
1> use Object::Pad
2> class Foo {
2... use Multimethod;
2... multi method say_things( Int $int ) { say "Got an int : $int" };
2... multi method say_things( Str $str ) { say "Got a string : $str" };
2... multi method say_things( $thing )   { say "Got something else : $thing" };
2... }
Variable "$class" will not stay shared at lib/Multimethod.pm line 112.
$res[0] = 1

Nothing catastrophic so far. Here we see three variants of say_things declared. It should hopefully be obvious what each of these methods do and what data they respond to.

Let's instantiate the class and see how the namespace looks:

3> my $foo = Foo->new
Object::Pad::MOP is experimental and may be changed or removed without notice at /home/fuzzix/perl5/perlbrew/perls/perl-5.38.0/lib/site_perl/5.38.0/Data/Printer/Filter/GenericClass.pm line 96.
$res[1] = Foo  {
    parents: Object::Pad::UNIVERSAL
    public methods (5):
        DOES, META, new, say_things
        Object::Pad::UNIVERSAL:
            BUILDARGS
    private methods (3): _multimethod_Int_say_things, _multimethod_Str_say_things, _multimethod_undef_say_things
    internals: []
}

We can see the new proxy method say_things, and the uniquely named methods for each variant of say_things declared above. A thought occurs - this system calls nominally "private" methods from another package. This is a definite wart.

The warning is from Data::Printer's inspection of object internals using the experimental Object::Pad::MOP.

Moving on, let's see if the multimethods work as expected:

4> $foo->say_things( 123 )
Got an int : 123
$res[2] = 1
5> $foo->say_things( 'abc' )
Got a string : abc
$res[3] = 1
6> $foo->say_things( {} )
Got something else : HASH(0x2f6a130)

$res[4] = 1

This looks good to me! (Merge it!)

Writing class definitions in a REPL isn't much fun, so I will write some basic tests and include them with the mlutimethod example source code.

Conclusion

While we ended up with a more-or-less functional mlutimethod implementation, it is not without problems (besides being woefully incomplete and lacking real validation, tests, etc.)

I think one major issue is the namespace pollution - we end up adding a chunk of mildly inscrutable private methods for each set of multimethod declarations. They may not have seemed so bad in the example above, but in even a modestly complex class, these methods may start to pile up.

There's also the fact that these nominally private methods are called from outside the class. Another approach might be to add a private delegate method to the class, but this clutters the namespace further. There's also AUTOLOAD - we could create a package named multi with an AUTOLOAD block which extracts a method name. That is, a proxy method named foo would call $self->multi::foo( @args ), rather than Multimethod::delegate. The package multi could then dispatch to the correct variant of foo based on the content of @args. See curry for an example of this type of thing. TMTOWTDI.

The generic isa object check for types not defined in Types::Standard is not going to work in 99.9% of cases. The typical package name (using :: OR ' package separators) really seems to confuse the signatures parser. My expectation was that the signature would be parsed after Multimethod rewrites it with a valid signature, but something else happens here. I don't currently have more insight.

It should be obvious that the example presented here is far from production ready ... but also, that one probably should not write this type of functionality themselves. While Types::Standard is fantastic, there is an ongoing effort dubbed 'Oshun', which aims to bring data checks into core. This work could form the basis of multimethod support in Corinna, the specification and exploration project behind core feature 'class'. Object::Pad is an exploratory implementation of feature 'class' which I used here for the MOP.

I think the last major issue I have is rewriting source code. While this is far more tighly controlled than plain source filters, it feels error-prone - displaced error messages, a weirded call-stack ... I need to look into the XS optree manipulation approach. XS::Parse::Keyword appears to offer some niceties to perhaps help me along the path. This distribution also includes XS::Parse::Infix to help support a relatively new infix operator API (oh yeah, this is the juice!) There may be a follow-up to this post.

As for the opening question, how easy was all this? Quite easy! I don't know one end of a compiler from the other, what I know about type systems could be transcribed to a postage stamp in crayon, and yet with a little dyanamism and a lot of sticky tape I built something that worked, after a fashion. I think the accessibility of Keyword::Simple and Keyword::Declare are a boon to the ecosystem. I can see myself making use of them again in future, albeit with some more care and attention based on what was observed here.

I learned something - I hope you did too.

Despite attempts to pre-emptively absolve myself of any responsibility for accuracy with talk of vagueness, I do not wish to spread disinformation. If something I've said here is completely wrong, feel free to reach out, or hit me across the nose with a rolled up newspaper and say "BAD!"