This is me: April 2007

Sunday, 22 April 2007

Erlang Macro Processor (v2), Part V

The final step for EMP2 is to expand any remote macro function calls and insert the results back into the AST.

Naively we would just follow the same pattern as the macro attribute expansion that we have just added:

node_parse(Node={call,Line,{remote,_,{atom,_,Mod},{atom,_,Fun}},Args}, Mods) ->
    case lists:member(Mod, Mods) of
        true ->
            ast_from_results(lists:flatten([apply(Mod,Fun,Args)|" "]), Line, []);
        false -> setelement(4,Node,node_parse(Args, Mods))
        end;

But if we do, we find that there are three(!) problems with this approach.

Firstly, ast_from_results is currently using erl_parse:parse_form to turn the textual macro results into an AST. This only works for complete Erlang forms (function definitions) and not for, say, a set of three Erlang expressions to be inserted into a function. We can fix this by using erl_parse:parse_exprs instead, but we will also have to append a full-stop and space to the result string (instead of just a space) to get it to work properly.

Secondly, the arguments for the function call are all in AST format with tuples and line numbers everywhere. We cannot just apply the function directly to these arguments; we need to convert them back to something more usable.

Finally, we may receive more than one Erlang expression from the macro. To fit these back into the space of a single node we have to wrap them in a block expression.

To tackle the first issue we need to update ast_from_results a little:

ast_from_results(FunParse, ResultsString, LineStart, ASTResults) ->
    case remove_leading_whitespace(ResultsString) of
        "" -> lists:flatten(lists:reverse(ASTResults));
        String ->
            {done,{ok,Tokens,LineEnd},StringRest} =
                erl_scan:tokens([], String, LineStart),
            {ok, AST} = erl_parse:FunParse(Tokens),
            ast_from_results(FunParse, StringRest, LineEnd, [AST|ASTResults])
        end.

As an aside, you might like to have a closer look at that erl_parse:FunParse call.

Yes, instead of hard-coding a function call or adding an extra if statement, we are calling the erl_parse function via a variable whose value we will not know until run-time[1]. Doesn't thinking about that just make you go all tingly inside? No? Me neither. Of course.

We can now use ast_from_results for erl_parse:parse_form and erl_parse:parse_exprs situations with only a single additional "erl_parse function" argument.

For the second issue I am going to use a cheap and nasty hack. Because we are not (yet) supporting anything fancier than literal terms in the argument list, we can get away with this little bit of trickery to convert the arguments into something usable by our call to apply:

ArgsLiteral = [Value || {_Type,_Line,Value} <- Args].

The third issue is also very easily fixed by wrapping the call to ast_from_results in a block expression tuple. We should only do this if there is more than one node in the results list:

node_parse(Node={call,Line,{remote,_,{atom,_,Mod},{atom,_,Fun}},Args}, Mods) ->
    case lists:member(Mod, Mods) of
        true ->
            ArgsLiteral = [Value || {_Type,_Line,Value} <- Args],
            Results = lists:flatten([apply(Mod,Fun,ArgsLiteral)|". "]),
            case length(Results) of
                1 -> hd(Results);
                _ -> {block,Line,ast_from_results(parse_exprs, Results, Line, [])}
                end;
        false -> setelement(4,Node,node_parse(Args, Mods))
        end;

Oh, and of course we need to update the other node_parse function clause to include the new argument to ast_from_results:

node_parse({attribute,Line,macro,{Mod,Fun,Args}}, _Mods) ->
    ast_from_results(parse_form, lists:flatten([apply(Mod,Fun,Args)|" "]), Line, []);

And with any luck we are done. Let's try it out on our example code.


1> CL = fun(F) -> c(F), l(F) end.
#Fun
2> CL(emp2), CL(example_macro), CL(example).
{module, example}
3> [example:lookup(N) || N <- lists:seq(0, 3)]. [0,2,4,6] 4>

Yep. EMP2 is done.

The full listing:


-module(emp2).
-author("Philip Robinson").
-vsn('1.0').
-export([parse_transform/2]).

parse_transform(AST, _Options) ->
    Mods = lists:flatten([Mods || {attribute,_Line,macro_modules,Mods} <- AST]),
    lists:flatten([node_parse(Node, Mods) || Node <- AST]).
node_parse({attribute,Line,macro,{Mod,Fun,Args}}, _Mods) ->
    ast_from_results(parse_form, lists:flatten([apply(Mod,Fun,Args)|" "]), Line, []);
node_parse(Node={call,Line,{remote,_,{atom,_,Mod},{atom,_,Fun}},Args}, Mods) ->
    case lists:member(Mod, Mods) of
        true ->
            ArgsLiteral = [Value || {_Type,_Line,Value} <- Args],
            Results = lists:flatten([apply(Mod,Fun,ArgsLiteral)|". "]),
            case length(Results) of
                1 -> hd(Results);
                _ -> {block,Line,ast_from_results(parse_exprs,Results,Line,[])}
                end;
        false -> setelement(4,Node,node_parse(Args, Mods))
        end;
node_parse(Node, Mods) when is_list(Node) ->
    [node_parse(Element, Mods) || Element <- Node];
node_parse(Node, Mods) when is_tuple(Node) ->
    list_to_tuple([node_parse(Element, Mods) || Element <- tuple_to_list(Node)]);
node_parse(Node, _Mods) -> Node.

args_from_ast(AST) -> [Value || {_Type,_Line,Value} <- AST].

ast_from_results(FunParse, ResultsString, LineStart, ASTResults) ->
    case remove_leading_whitespace(ResultsString) of
        "" -> lists:flatten(lists:reverse(ASTResults));
        String ->
            {done,{ok,Tokens,LineEnd},StringRest} =
                erl_scan:tokens([], String, LineStart),
            {ok, AST} = erl_parse:FunParse(Tokens),
            ast_from_results(FunParse, StringRest, LineEnd, [AST|ASTResults])
        end.

remove_leading_whitespace([9 |String]) -> remove_leading_whitespace(String);
remove_leading_whitespace([10|String]) -> remove_leading_whitespace(String);
remove_leading_whitespace([32|String]) -> remove_leading_whitespace(String);
remove_leading_whitespace(    String ) -> String.

EMP2: Entirely painful compile-time macros for functions and expressions, in 45 lines of obscure, uncommented, and unreadable Erlang code.
(If you think that this code is bad, wait until you see EMP3.)

[1] Run-time for EMP2 is, of course, compile-time for the module that we are using EMP2 to transform.

Erlang Macro Processor (v2), Part IV

Okay, now we are getting somewhere. Time to expand some macros!

To begin with we will start with something easy, like duplicating EMP1's functionality. We already have code from EMP1 to expand the -macro attribute entries, but unfortunately we cannot just cut-and-paste the EMP1 code into EMP2; our AST-walking is slightly different and we need to adjust ast_reversed_results:


ast_from_results(ResultsString, LineStart, ASTResults) ->
   case remove_leading_whitespace(ResultsString) of
       "" -> lists:reverse(ASTResults);
       String ->
           {done,{ok,Tokens,LineEnd},StringRest} =
               erl_scan:tokens([], String, LineStart),
           {ok, AST} = erl_parse:parse_form(Tokens),
           ast_from_results(StringRest, LineEnd, [AST|ASTResults])
       end.

We change the -macro clause for node_parse to call the new function:


node_parse({attribute,Line,macro,{Mod,Fun,Args}}, _Mods) ->
   ast_from_results(lists:flatten([apply(Mod,Fun,Args)|" "]), Line, []);

And that obscene remove_leading_whitespace function has returned:


remove_leading_whitespace([9 |String]) -> remove_leading_whitespace(String);
remove_leading_whitespace([10|String]) -> remove_leading_whitespace(String);
remove_leading_whitespace([32|String]) -> remove_leading_whitespace(String);
remove_leading_whitespace(    String ) -> String.

The only difference between ast_from_results and ast_reversed_results is that ast_from_results keeps the resulting AST in the same order as the input ResultsString argument (it kindly reverses its already-reversed results for us before passing them back).

Unlike EMP1, EMP2 does NOT want to receive the results of the expanded AST in reversed order. We are not following the "build a list in reverse and then reverse the result" model for our AST (which works just fine for traversing the top level only), but rather using a recursive descent model for AST parsing. In this situation we need to keep the results in the order that they appear.

Now we have the EMP2 module reproducing the functionality of EMP1, and at only a few more lines of code. The only thing left to do is identify macro function calls, apply them, and insert the parsed results into the AST in place of the original call.

Ha!

For remote function calls we have two situations to handle:

The remote function call is to a macro, and
The remote function call is not to a macro.

The easier case is when the remote function call is not to a macro function. We pretty much just want the default tuple node function to run on the node, but we cannot (easily) get there because this more-specific function clause will have intercepted the node before the default code gets a chance to run on it.

We could encapsulate the common default code in another function (or a substitution macro), but for simplicity's sake I will just build the required node in place with the setelement function. It is not a large amount of code:


node_parse(Node={call,Line,{remote,_,{atom,_,Mod},{atom,_,Fun}},Args}, Mods) ->
   case lists:member(Mod, Mods) of
       true ->
           io:format("Function-level macro call: ~w~n", [Node]),
           Node;
       false -> setelement(4,Node,node_parse(Args, Mods))
       end;

Next up: The final installment - expanding remote macro function calls.

Erlang Macro Processor (v2), Part III

The top level of the AST is a list of nodes, rather than a node in its own right, so we might write our first attempt at an[other] AST walker like this:


parse_transform(AST, _Options) ->
    Mods = lists:flatten([Mods || {attribute,_Line,macro_modules,Mods} <- AST]),
    lists:flatten([node_parse(Node, Mods) || Node <- AST]).

node_parse(Node, _Mods) -> Node.

The parse_transform function calls node_parse on each top-level node in the AST. It calls lists:flatten on the result because - as we already know - the EMP1-variety of top-level macro expansion may return more than one function definition from a single macro call. These definitions all need to be at the same "height" as the others, so the resulting deep list of nodes needs to be flattened.

These two functions together will traverse the top level of the AST but not examine any sub-nodes. To do that we need to split the atom... er, node tuples, and parse each element in sequence:


node_parse(Node, Mods) when is_tuple(Node) ->
    list_to_tuple([node_parse(Element, Mods) || Element <- tuple_to_list(Node)]).

Now if we were to compile and run this on our example.erl file we would get a big fat error... it turns out that not every element in a node tuple is actually another node tuple (but we already knew that, too). Some of the elements are lists, and some of them are atoms or integers. A few extra clauses should take care of these conditions:


node_parse(Node, Mods) when is_list(Node) ->
    [node_parse(Element, Mods) || Element <- Node];
node_parse(Node, _Mods) -> Node.

Here is the whole module in one piece:


-module(emp2).
-export([parse_transform/2]).

parse_transform(AST, _Options) ->
    Mods = lists:flatten([Mods || {attribute,_Line,macro_modules,Mods} <- AST]),
    lists:flatten([node_parse(Node, Mods) || Node <- AST]).

node_parse(Node, Mods) when is_list(Node) ->
    [node_parse(Element, Mods) || Element <- Node];
node_parse(Node, Mods) when is_tuple(Node) ->
    [Type,Line|ListElements] = tuple_to_list(Node),
    Results = [node_parse(Element, Mods) || Element <- ListElements],
    list_to_tuple([Type,Line|Results]);
node_parse(Node, _Mods) -> Node.

And that is all we need to generically walk the entire AST.

Trapping the specific nodes we want to macro-expand is also rather trivial. We need to catch macro module attributes and remote function calls, and to do that we just add two new clauses to the node_parse function:


node_parse(Node={attribute,Line,macro,{Mod,Fun,Args}}, _Mods) ->
    io:format("Line ~B: EMP1-style macro attribute found.~n", [Line]),
    % Do macro-expansion of attribute's Mod, Fun, and Args values.
    Node;
node_parse(Node={call,Line,{remote,L,{atom,_,Mod},{atom,_,Fun}},Args}, Mods) ->
    io:format("Line ~B: EMP2-style remote function call found.~n", [Line]),
    % Test whether the remote call is to a macro module.
    % If so, expand it.  Otherwise traverse node as usual.
    setelement(4, Node, node_parse(Args, Mods));

Next up: Expanding the macros.

Saturday, 21 April 2007

Erlang Macro Processor (v2), Part II

You know we need to do it eventually, so let's get the boring "find -macro_modules attributes and store their values" bit out of the way so we can move on to some more interesting stuff. Here we go:


-module(emp2).
-export([parse_transform/2]).

parse_transform(AST, _Options) ->
    Mods = lists:flatten([Mods || {attribute,_Line,macro_modules,Mods} <- AST]),
    io:format("Macro Modules: ~p~n", [Mods]),
    AST.

Ah, whoops. One of those list comprehension thingies seems to have slipped into the parse_transform function. To get rid of it we just have to change that line into something like this:

    Mods = lists:flatten(lists:map(
        fun ({attribute,_Line,macro_modules,Mods}) -> Mods;
            (_Node) -> []
            end,
        AST)),

Hmmm. On second thoughts, maybe we should keep the list comprehension.

I believe that list comprehensions are a relatively new feature in Erlang so you may not see too many of them in existing code, but they really are worth learning. (Erlang is in good company: Python and Haskell have list comprehensions too.)

Back from that tangent and to the program at hand, we see that the macro module names are being stored in an ordinary list. I expect that only a few macro modules (probably only one at most) will be specified in any given module, and looking for an element in a one-element list is pretty quick, so we should not be needing the indexing overhead of a dictionary. I also don't particularly mind if someone specifies a macro module more than once, or if a specified macro module is never used. (If we were really concerned about duplicate macro module names then we could use one of the list module functions to easily remove them.)

We could also roll the gathering of the macro_modules attributes up into the AST-walking code, but conceptually it is nicer to keep it up here and out of the way. Also, as this code only traverses the very top level of the AST it should be quite quick. Pattern-matching one entry per module attribute and function definition is not a computationally expensive task.

Right, the boring stuff is done; let's get into parsing that AST.

As I briefly mused at the bottom of a previous Atomiser post:

Rather than consisting of a bunch of pattern matching clauses, the walk_ast function could be made "smarter" by transforming the given node tuple into a list, and applying some rules-based logic to the elements of that list (from the third element onwards).

I reckon we could give this a go and see where we end up. (Either it will work and we have learned something new, or it won't work and we will have learned something new, so it is a win-win situation either way.)

You might recall that the Atomiser walk_ast function had a clause for each node type. This was a great way for me to implement the Atomiser because I got to see the AST nodes that made up my programs, but in the end it has turned out to be a pretty ugly function.

Here are a few lines of the walk_ast function as a quick refresher (the substitution macro actually makes the code nicer than it could be):


?WALK_AST({call,_Line,_Fun,Args}, Args); % Handles local and module calls.
?WALK_AST({'case',_Line,Test,Clauses}, [Test|Clauses]);
?WALK_AST({'catch',_Line,Expr}, Expr);
?WALK_AST({char,_Line,_Char}, []);

And those clauses go on (and on!) for about forty different node types...

I would much rather only have specific clause for handling each node that we are interested in, and use some generic code to handle the rest. But if we want to create some rules to manage these nodes generically then we had better find some patterns in all of that mess.

...

The first (and blindingly obvious) thing to notice about the nodes is that - without exception - they are all tuples. (I know, I know: I am a genius. Applause is not strictly necessary. Really. Oh, all right then, a little bit of applause is okay, if you insist.)

Two of these tuple nodes are not quite the same as the others: {error, Details} and {warning, Details}. In all of the other nodes the first element is the node type and the second element is the line number of the source file that the node appears in. After that there are a variable number of elements (possibly none) with node-specific meanings.

We are interested in catching -macro attributes (so EMP2 can do the work of EMP1) as well as remote function call nodes that are calling a macro function. Everything else is irrelevant, except that we want to recursively descend into sub-nodes to keep searching for other remote macro function calls.

If we take a closer look at the elements of nodes we will note that the element is always either a list, a tuple, or atomic (i.e.: an atom or an integer). These elements might have a special meaning to the compiler (depending on their location in the current node tuple) but to us they are just potential sub-nodes. If the node does not match an attribute or remote function call pattern then the elements have no meaning to EMP2 and we can treat them as homogenous lumps of node matter.

Of the additional elements in a node (if any), they are either

a list, which we can parse as its own sub-AST,
a tuple, which we can parse as another node, or
atomic (or integer), which we can pass back as-is.

I think that all of these notes are probably enough to get us started coding.

Thursday, 19 April 2007

Erlang Macro Processor (v2), Part I

EMP1 is all well and good, but it does have more than its fair share of idiosyncratic behaviour[1]:

EMP1 can only be used to create full functions at the top level of a module. This makes it a bit more difficult to use than strictly necessary, especially if we only want to generate a term to use within a function.
Arguments passed to the macro must be literal values - no function calls allowed!
Macros must be defined in a separate module, which must be compiled before the macro-using module is compiled.

Quite frankly that first point bugs the hell out of me. I really should not have to write a macro that returns an entire function definition if I only need to generate a small portion of a function.

Today we will begin to tackle this issue with EMP2, but before we dive straight into the parse_transform code I would like to spend a few moments updating our example code. The rewrite will make the example_macro.erl and example.erl modules use the as-yet-unwritten EMP2 module functionality. I probably won't explicitly show it in these posts, but the compile errors I get from running EMP2 over example.erl will have a big influence over the direction that its development takes.

We will still need a separate macro module, but the macro function will only generate the lookup table itself rather than return a whole function definition:

-module(example_macro).
-export([lookup_binary/1]).

lookup_binary(Size) ->
    [[$,,FirstVal]|NumberString] = lists:map(
        fun(Offset) -> io_lib:format(",~B", [Offset * 2]) end,
        lists:seq(0, Size - 1)),
    "<<" ++ [FirstVal] ++ NumberString ++ ">>".

We have lost the code that produces the whole function and only kept our lookup binary creation function, which also seems to have picked up a jaunty little Size argument from somewhere. As before, each element's value is twice its offset (modulo 256: we are only storing bytes after all).

To check that the new macro code works correctly:

1> CL = fun(F) -> c(F), l(F) end.
#Fun
2> CL(example_macro).
{module,example_macro}
3> M = fun(R) -> io:format("~s~n", [lists:flatten(R)]) end.
#Fun
4> M(example_macro:lookup_binary(4)).
<<0,2,4,6>>
ok
5>

And we also have to rewrite the module that calls this lookup macro:

-module(example).
-export([lookup/1]).

-compile({parse_transform, emp2}).
-macro_modules([example_macro]).

lookup(Offset) ->
    <<_:offset/binary,value:8/integer,_/binary>> =
        example_macro:lookup_binary(4),
    Value.

This does look a lot nicer than the EMP1 version. Only the snippet of code that needs to be dynamically generated is in the macro module; the rest of the code is in the standard module where it belongs, and the macro call is in a much more appropriate place - inside the function that uses it - than lurking within a module attribute.

With EMP1 we had to peek inside another module to see that a lookup/1 function was being generated, but here we can see that fact already in front of us. We can even guess that a binary term will be created just from the context around the macro call.

Note that 'emp1' has changed to 'emp2' in the parse_transform compiler directive, and that we need a new 'macro_modules' module attribute to tell EMP2 which remote function calls are to be expanded at compile-time.

Once we have written EMP2 and compiled all the modules,we should be able to run the lookup function and receive the same results as we did before:

1> lists:map(fun(N) -> example:lookup(N) end, lists:seq(0, 3)).
[0,2,4,6]
2>

We shall see.

[1] And I cannot have all that competition floating around out there, you know.

Tuesday, 17 April 2007

The Atomiser, Redux

I have received some great comments and suggestions regarding the Atomiser; as a result I have added a new (optional) feature to the module. (Don't worry - The Atomiser may be new and improved, but is still 100% backwardly-compatible!)

As usual, you may specify a list of globally-valid atoms:

-atoms([atom1, atom2...]).

You may now also specify function-specific atom lists in two ways. The first method is to add a function name (only) to an atoms declaration entry. The atoms specified will then be valid within all 'fun_name' functions, regardless of the arity of those function definitions:

-atoms({fun_name, [atom1, atom2...]}).

(Unfortunately we have to wrap this all information up in a single tuple: 'wild' module attributes can only contain one value.)

To be even more specific you may add a function name and an arity to an atoms declaration. These atoms will then be valid within that specific 'fun_name/arity' function definition:

-atoms({fun_name, arity, [atom1, atom2...]}).

Atoms declarations are cumulative: globally-valid atoms (if any) are included along with function and function/arity atoms when checking for valid atoms within a given function definition.

You might notice that in the code below I have added a few new clauses into the walk_ast function. I was a bit concerned that I may have missed some node types from the Erlang AST, so I cracked open the only reference I had seen of the Abstract Format and added a few more function clauses that I had initially overlooked. I am pretty sure that just about everything is in there now, but feel free to disabuse me of that notion. :-)

Finally, I cleaned up the ?WALK_AST macro a little so that it no longer requires a list of ASTs to process: it now works directly off a single AST. Removing embedded lists has simplified the use of this macro quite considerably.

The new Atomiser Module:

-module(atomiser).
-author("Philip Robinson").
-vsn('1.1.1').
-export([parse_transform/2]).
%-compile({parse_transform, atomiser}). % Uncomment after initial compile.

-atoms([base_dict_key,error, ok]). % Atoms used in four or more functions.
-atoms({atoms_check, 5, [found]}).
-atoms({atoms_unused_print, 1, [found]}).
-atoms({key_more_general, 1, [function]}).
-atoms({parse_transform, 2, [report_warnings,true]}).
-atoms({walk_ast, 3, [atom, atoms, attribute, b_generate, bc, bin, bin_element,
        block, call, 'case', 'catch', char, clause, clauses, cons, eof, float,
        'fun', function, generate, 'if', integer, lc, match, nil, op, 'query',
        'receive', record, record_field, string, 'try', tuple, var, warning]}).

parse_transform(AST, Options) ->
    DictAtomsAll = dict:store(base_dict_key, dict:new(), dict:new()),
    case lists:member(report_warnings, Options) of
        true -> atoms_unused_print(walk_ast(AST, base_dict_key, DictAtomsAll));
        _ -> ok
        end,
    AST.

dict_with_added_atoms(Line, AtomList, DictInitial) ->
    AddAtom = fun(Atom, Dict) ->
        case dict:find(Atom, Dict) of
            {ok,LineAlreadyDefined} ->
                io:format(
                    "~s:~B Warning: atom '~w' already defined on line ~B.~n",
                    [?FILE, Line, Atom, LineAlreadyDefined]),
                Dict;
            error -> dict:store(Atom, Line, Dict)
            end
        end,
    lists:foldl(AddAtom, DictInitial, AtomList).

atoms_from_attr(Line, Key, AtomList, Atoms) ->
    Dict = case dict:find(Key, Atoms) of {ok,D} -> D; error -> dict:new() end,
    dict:store(Key, dict_with_added_atoms(Line, AtomList, Dict), Atoms).

atoms_check(Atom, Line, KeyDict, Atoms) ->
    case dict:find(KeyDict, Atoms) of
        {ok,Dict} -> atoms_check(Atom, Line, KeyDict, Dict, Atoms);
        error -> atoms_check(Atom, Line, key_more_general(KeyDict), Atoms)
        end.

atoms_check(Atom, Line, KeyDict, Dict, Atoms) ->
    case dict:find(Atom, Dict) of
        {ok,found} -> Atoms;
        {ok,_LineDefinedOn} ->
            dict:store(KeyDict, dict:store(Atom,found,Dict), Atoms);
        error ->
            case KeyDict of
                base_dict_key ->
                    io:format("~s:~B Warning: atom '~w' unexpected.~n",
                        [?FILE, Line, Atom]),
                    Atoms;
                _ -> atoms_check(Atom, Line, key_more_general(KeyDict), Atoms)
                end
        end.

key_more_general({function,Fun,_Arity}) -> {function,Fun};
key_more_general({function,_Fun}) -> base_dict_key.

atoms_unused_print(Atoms) ->
    Filter = fun({_Atom,Line}) -> Line =/= found end,
    DictsToList = fun({_DictKey,Dict}, UnusedAtoms) ->
        UnusedAtomsNew = lists:filter(Filter, dict:to_list(Dict)),
        UnusedAtomsNewSorted = lists:keysort(2, UnusedAtomsNew),
        lists:keymerge(2, UnusedAtomsNewSorted, UnusedAtoms)
        end,
    UnusedAtoms = lists:foldl(DictsToList, [], dict:to_list(Atoms)),
    PrintUnusedAtom = fun({Atom,Line}) ->
        io:format("~s:~B Warning: atom '~w' unused.~n", [?FILE, Line, Atom])
        end,
    lists:foreach(PrintUnusedAtom, UnusedAtoms).

-define(WALK_AST(PatternToMatch, ExpressionsToProcess),
    walk_ast([PatternToMatch|ASTRest], Key, Atoms) ->
        walk_ast(ASTRest, Key, walk_ast(ExpressionsToProcess, Key, Atoms))).

walk_ast([], _Key, Atoms) -> Atoms;
walk_ast([{atom,Line,Atom}|RestAST], Key, Atoms) ->
    walk_ast(RestAST, Key, atoms_check(Atom, Line, Key, Atoms));
walk_ast([{attribute,Line,atoms,{Fun,Arity,AtomList}}|RestAST], Key, Atoms) ->
    AtomsNew = atoms_from_attr(Line, {function,Fun,Arity}, AtomList, Atoms),
    walk_ast(RestAST, Key, AtomsNew);
walk_ast([{attribute,Line,atoms,{Fun,AtomList}}|RestAST], Key, Atoms) ->
    AtomsNew = atoms_from_attr(Line, {function,Fun}, AtomList, Atoms),
    walk_ast(RestAST, Key, AtomsNew);
walk_ast([{attribute,Line,atoms,AtomList}|RestAST], Key, Atoms) ->
    AtomsNew = atoms_from_attr(Line, base_dict_key, AtomList, Atoms),
    walk_ast(RestAST, Key, AtomsNew);
?WALK_AST({attribute,_Line,_Name,_Value}, []); % Ignore all other attributes.
?WALK_AST({b_generate,_Line,Pattern,Expression}, [Pattern, Expression]);
?WALK_AST({bc,_Line,Head,Tail}, [Head|Tail]);
?WALK_AST({bin,_Line,BinElements}, BinElements);
?WALK_AST({bin_element,_Line,_Name,_Size,_Type}, []);
?WALK_AST({block,_Line,Expr}, [Expr]);
?WALK_AST({call,_Line,_Fun,Args}, Args); % Handles local and module calls.
?WALK_AST({'case',_Line,Test,Clauses}, [Test|Clauses]);
?WALK_AST({'catch',_Line,Expr}, Expr);
?WALK_AST({char,_Line,_Char}, []);
walk_ast([{clause,_Line,Pattern,Guards,Body}|RestAST], Key, Atoms) ->
    AtomsGuard = lists:foldl(
        fun(ASTGuard, AtomsGuard) ->
            walk_ast(ASTGuard, Key, AtomsGuard)
            end,
        walk_ast(Pattern, Key, Atoms), Guards),
    walk_ast(ASTRest, Key, walk_ast(Body, Key, AtomsGuard));
?WALK_AST({cons,_Line,Left,Right}, [Left,Right]);
?WALK_AST({eof,_Line}, []);
?WALK_AST({error,_Details}, []); % Ignore compiler errors.
?WALK_AST({float,_Line,_Float}, []);
?WALK_AST({'fun',_Line,{clauses,Clauses}}, Clauses);
?WALK_AST({'fun',_Line,_ModuleFunArity}, []);
walk_ast([{function,_Line,Fun,Arity,Clauses}|RestAST], Key, Atoms) ->
    walk_ast(RestAST, Key, walk_ast(Clauses, {function,Fun,Arity}, Atoms));
?WALK_AST({generate,_Line,Pattern,Expression}, [Pattern, Expression]);
?WALK_AST({'if',_Line,Clauses}, Clauses);
?WALK_AST({integer,_Line,_Integer}, []);
?WALK_AST({lc,_Line,Head,Tail}, [Head|Tail]);
?WALK_AST({match,_Line,Pattern,Expression}, [Pattern, Expression]);
?WALK_AST({nil,_Line}, []);
?WALK_AST({op,_Line,_BinaryOp,Left,Right}, [Left, Right]);
?WALK_AST({op,_Line,_UnaryOp,_Operand}, []);
?WALK_AST({'query',_Line,ListComprehension}, [ListComprehension]);
?WALK_AST({'receive',_Line,Clauses}, Clauses);
?WALK_AST({'receive',_Line,Clauses1,_TimeAfter,Clauses2}, Clauses1 ++ Clauses2);
?WALK_AST({record,_Line,_Record,Fields}, Fields);
?WALK_AST({record_field,_Line,Field,Value}, [Field, Value]);
?WALK_AST({record_field,_Line,_Variable,_Record,Field}, [Field]);
?WALK_AST({string,_Line,_String}, []);
?WALK_AST({'try',_Line,Block,CaseClauses,CatchClauses,AfterClauses},
            [Block] ++ CaseClauses ++ CatchClauses ++ AfterClauses);
?WALK_AST({tuple,_Line,Elements}, Elements);
?WALK_AST({var,_Line,_Name}, []);
?WALK_AST({warning,_Details}, []); % Ignore compiler warnings.
walk_ast([Node|ASTRest], Key, Atoms) ->
    io:format("Unknown node: ~p~n", [Node]),
    walk_ast(ASTRest, Key, Atoms).

PS: Does anyone know of an easy way to get Blogger to indent code properly? I am getting a little tired of pasting loads of " " everywhere...

Sunday, 15 April 2007

"Dynamic" record access functions with EMP1

Brian Olsen (over at Programming Experiments) wrote a small set of functions to make record accesses/updates in Erlang nicer. Ayrnieu wrote a detailed response to this in a comment on Reddit.

Brian wanted to hide some of the (admittedly pretty ugly) syntax of Erlang records in a simple way. He used some run-time list-searching to find the position in the record tuple that a particular field name occurs at, and then located the desired value at that position.

Now that we have EMP1 working I thought that perhaps I might see how I would use this particular tool to solve the same problem.

First of all we need to figure out what the functions we want should look like. I think something like this would do nicely:

recval(FieldName, Record) -> Value.
setrecval(FieldName, Record, Value) -> Updated Record.

Of course under the covers recval and setrecval would examine the record given and work out which field to retrieve / update.

Both Brian and Ayrneiu have this work done at run-time. With EMP1 we can build the supporting functions at compile-time based on the record information (which is already known at compile-time).

In detail, recval and company would look something like this:

recval(FieldName, Record) -> recval(element(1, Record), FieldName, Record).
recval(record1, field1, Record) -> element(2, Record);
recval(record1, field2, Record) -> element(3, Record);
recval(record2, field1, Record) -> element(2, Record);
...

...and similarly for the setrecval versions.

These functions can all be created at compile-time with EMP1, like this:

-module(dyrec_macro).
-export([recval_generate/1]).

recval_field(NameRecord, NameField, Posn) ->
    io_lib:format("recval(~w, ~w, Record) -> element(~B, Record)",
        [NameRecord, NameField, Posn]).

setrecval_field(NameRecord, NameField, Posn) ->
    io_lib:format(
        "setrecval(~w, ~w, Record, Value) -> setelement(~B, Record, Value)",
        [NameRecord, NameField, Posn]).

recval_record(RecordDetails) -> recval_record(RecordDetails, 2, []).
recval_record({_NameRecord, []}, _Posn, Text) -> Text;
recval_record({NameRecord, [NameField|NameFieldsRest]}, Posn, Text) ->
    recval_record({NameRecord, NameFieldsRest}, Posn + 1,
        Text ++ "; " ++ recval_field(NameRecord, NameField, Posn)).

setrecval_record(RecordDetails) -> setrecval_record(RecordDetails, 2, []).
setrecval_record({_NameRecord, []}, _Posn, Text) -> Text;
setrecval_record({NameRecord, [NameField|NameFieldsRest]}, Posn, Text) ->
    setrecval_record({NameRecord, NameFieldsRest}, Posn + 1,
        Text ++ "; " ++ setrecval_field(NameRecord, NameField, Posn)).

recval_generate(ListRecordDetails) ->
    [$;,32|CodeGet] = lists:flatten(
        lists:map(fun(E) -> recval_record(E) end, ListRecordDetails)),
    [$;,32|CodeSet] = lists:flatten(
        lists:map(fun(E) -> setrecval_record(E) end, ListRecordDetails)),
    "recval(Field, Record) -> recval(element(1, Record), Field, Record). "
    "setrecval(Field, Record, Value) -> "
        "setrecval(element(1, Record), Field, Record, Value). " ++
        io_lib:format("~s. ~s.", [CodeGet, CodeSet]).

And here is a test program:

-module(dyrec_test).
-export([start/0]).
-compile({parse_transform, emp1}).

-record(data1, {this, that}).
-record(data2, {this, the_other}).

-macro({dyrec_macro, recval_generate,
    [[{data1, [this, that]}, {data2, [this, the_other]}]]}).

start() ->
    D1 = #data1{this=a, that=b},
    D2 = #data2{this=c, the_other=d},
    D3 = setrecval(this, D1, e),
    io:format("~p~n~p~n~p~n~p~n~p~n",
        [recval(this, D1), recval(that, D1),
         recval(this, D2), recval(the_other, D2),
         D3]).

After compiling both of them, we can run this at the REPL:

1> dyrec_test:start().
a
b
c
d
{data1,e,b}
ok
2>

Personally I would not use EMP1 for this (particular) purpose. I do not mind Erlang's record syntax, but if I really did not want to use it I would rather build a parse-transformation (a la Yariv's recless module) to convert a different syntax into the record tuples Erlang uses behind the scenes.

By layering function calls on top of record/tuple field accesses we destroy the ability of Erlang's compiler to convert the usual record syntax into direct tuple element lookups at the point of reference. With this approach the runtime now has to perform a pattern match on record and field names before finding the appropriate value. (Possibly this overhead could be removed by the use of the compiler's 'inline' option, though.)

So my verdict on this jaunt into using EMP1 for layering function calls on record accesses, is "certainly possible, but not necessarily practical". Wait for EMP2 and use that instead. :-)

Friday, 13 April 2007

Erlang Macro Processor (v1), Part IV

We know what we want, we know how we want to use it, and so without further ado, here it is: the code for EMP1.

-module(emp1).
-author("Philip Robinson").
-vsn('1.0').
-export([parse_transform/2]).

parse_transform(ASTIn, _Options) -> walk_ast(ASTIn, []).

walk_ast([], ASTOut) -> lists:reverse(ASTOut);
walk_ast([{attribute,Line,macro,{Mod,Fun,Args}}|RestASTIn], ASTOut) ->
    ReversedResults =
        ast_reversed_results(
            lists:flatten([apply(Mod,Fun,Args)|[" "]]),
            Line, []),
    walk_ast(RestASTIn, ReversedResults ++ ASTOut);
walk_ast([Node|ASTInRest], ASTOut) -> walk_ast(ASTInRest, [Node|ASTOut]).

ast_reversed_results(ResultsString, LineStart, ASTResults) ->
    case string_trim_whitespace(ResultsString) of
        "" -> ASTResults;
        String ->
            {done,{ok,Tokens,LineEnd},StringRest} =
                erl_scan:tokens([], String, LineStart),
            {ok, AST} = erl_parse:parse_form(Tokens),
            ast_reversed_results(StringRest, LineEnd, [AST|ASTResults])
        end.

string_trim_whitespace([ 9|String]) -> string_trim_whitespace(String);
string_trim_whitespace([10|String]) -> string_trim_whitespace(String);
string_trim_whitespace([32|String]) -> string_trim_whitespace(String);
string_trim_whitespace( String ) -> String.

And that is it - 30 lines of code.

No, really. That is all there is. I can take you through it in some detail, if you want. Fasten your seat-belts.

EMP1 In Detail

I will start with the "mostest ugliest" piece of Erlang code I have ever written: the string_trim_whitespace function.

This function returns the given string minus any leading tabs, carriage returns, or spaces. I searched the Erlang documentation and the Trap Exit website but I did not manage to find any built-in functions that achieved the same goal. Four lines of code seems a bit excessive for what it actually does and I am sure there must be a nicer way of writing it.

This function is actually a reasonably good example of Erlang pattern-matching and tail-recursion at work. If the given string begins with a tab (ASCII 9), carriage return (ASCII 10), or a space (ASCII 32), then it will match one of the first three function clauses. The first character will be dropped and the function recursively called with the rest of the string.

If the string does not match any of those three function clauses then it must not have a tab, carriage return, or space at the beginning, so the given string is returned as-is. This even works for the empty string. (Technically it would also match any non-string argument - integer, float, tuple, or whatever - and just return the input given.)

Even though the function uses recursion there is no danger of the stack blowing out no matter how large the string is. Erlang (like most functional languages) has a neat trick of turning tail recursion calls into a goto loop so the function executes in constant memory space. Others have explained tail-recursion better than I can, so let's move on...

Next on the list is the walk_ast function, which runs through the top level of the inbound AST and builds an outbound AST. The outbound AST list is built in reverse order to take advantage of the cheap list 'cons' operation: it is very quick to add or remove an element at the beginning of a list but much more expensive to add or remove an element at the end of a list. When the whole inbound AST has been processed (the argument matches the empty list) then the outbound AST is run through the lists:reverse function to switch it back to the right-way-around order again. If you are not yet familiar with this build-in-reverse-then-switch idiom, you soon will be. :-)

There are only three function clauses in this walk_ast function: The final case where we reverse and return the new AST, processing a 'macro' module attribute, and everything else.

The 'final' case I have covered above, and the 'everything else' case just passes the node straight to the outbound AST. The magic of EMP1 happens in the macro module attribute clause.

The walk_ast function looks for macro atributes of this form:

-macro({Module, Function, Args}).

When it finds a match it calls the module/function with the given args and captures the result, which should be a string representation of an Erlang function. It adds this string to the beginning of a list containing a single space and then flattens the total result.

A space is added to the end of the return string because erl_scan:tokens has a problem parsing something like "42." - it cannot tell if this is the beginning of a floating-point number. To avoid this I add a space to the end of the string; erl_scan:tokens knows that "42. " is just the integer 42.

The resulting string is also flattened because io_lib:format does some funny things when you use "~s" to embed a value string into a format string. For example, io_lib:format("ab~se", ["cd"]) produces [97,98,"cd",101] instead of an expected (in my opinion) "abcde". This might be okay for printing, which I presume flattens its input as it goes, but this is a terrible format for erl_scan to tokenise.

Once the macro's return string has been mutilated enough it is passed on to ast_reversed_results, for some further mangling.

The ast_reversed_results function does pretty much all the heavy lifting for the module. It takes in the current result string (a flattened text representation of one or more Erlang functions with a space at the end), the line the module attribute was declared, and the current AST list of processed results (in reversed order as per the functional programming idiom mentioned above).

The very first thing this function does is to strip leading whitespace characters from the input string, and test that result against the empty string.

For some reason erl_scan:tokens returns a {more, SomeWeirdStuff} tuple when it is handed a string of whitespace characters (and also when given the empty string). I have no idea what I should do with this result so I strip the leading whitespace characters out and test for the empty string instead.

If the stripped string is not empty then we want to tokenise and parse the first form (which should be a function definition), add the parsed results to the beginning of our AST list, and try again with the rest of the string (as it is possible to include more than one function definition in the macro return string).

If the stripped string is empty then there is nothing left to process and we can return the (reversed) AST of result. We keep these in reversed order because it is just pre-pended to the walk_ast's ASTOut, and it will all be re-reversed at the end.

Whew!

EMP1 Epilogue and Notes

* An interesting 'feature' of EMP1 is that it may be used to create functions where the function name is programmatically generated. I am not sure why you might choose to create a whole bunch of separate, named functions over, say, creating one function with multiple clauses triggered by an atom argument, but EMP1 certainly makes it possible.

* I would recommend avoiding carriage returns in macro output strings. It does not actually break anything, but it tends to obfuscate the stack trace output of any runtime exceptions thrown from the generated code.

* One advantage of compile-time macros over run-time function-building techniques is that the usual compiler checks are run over the generated code. (The macro-created code is actually there at compile-time rather than appearing later at run-time.) I like to get my bug reports early, and if the compiler can complain then I don't need to wait for unit tests to raise an issue.

Using compile-time macros also means that static code analysis tools such as the Dialyzer will include the generated functions in its analysis and report.

There are, however, situations where not all of the information needed to create a function is available at compile-time. If you find yourself in such a predicament you might want to check out Yariv's smerl project, which makes it a lot easier to do runtime meta-programming.

I might need to use smerl when I write EMP2.

Thursday, 12 April 2007

Erlang Macro Processor (v1), Part III

EMP1 in Action
(How will I use this thing?)

A macro function must already be compiled before EMP1 can call it, so I will probably have to create an "example_macro.erl" module for each "example.erl" module, as needed.

The macro modules will simply export functions that return strings. Ideally the returned strings will be valid Erlang functions... if they are not valid Erlang functions then we will get to watch the pretty fireworks from the compiler. :-)

To generate a lookup function I might have a macro module something like this:

-module(example_macro).
-export([lookup_generate/0]).

lookup_generate() ->
    io_lib:format(
        "lookup(Offset) -> "
            "<<_:offset/binary,Value:8/integer,_/binary>> = <<~s>>, "
            "Value.",
        [lookup_table()]).

lookup_table() ->
    [[$,,FirstVal]|NumberString] =
        lists:map(
            fun(Offset) ->
                io_lib:format(",~B", [Offset * 2])
                end,
            lists:seq(0, 3)),
    lists:flatten([FirstVal|NumberString]).

In this example the lookup table generated has only four elements; each element's value in the table is just its offset multiplied by two.

If we were to compile and run example_macro:lookup_generate() directly we would see a whole string of numbers... to actually see what was produced in a readable format we will want to do something like this:

1> io:format("~s~n", [lists:flatten(example_macro:lookup_generate())]).

...actually, we probably want to do this instead, to save typing:

1> M = fun(R) -> io:format("~s~n", [lists:flatten(R)]) end.
#Fun
2> M(example_macro:lookup_generate()).
lookup(Offset) -> <<_offset/binary,value:8/integer,_/binary>> = <<0,2,4,6>>, Value.
ok
3>

Hey, that even looks like a valid function... not bad as a proof of concept, but I really hope no-one uses a macro for a table that small!

Here is the module that I want this lookup function to be created in:

-module(example).
-compile({parse_transform, emp1}).
-export([lookup/1]).

-macro({example_macro, lookup_generate, []}).

Note that the export directive is referring to a function that does not actually exist in the source file. This function will be generated at compile-time by EMP1 calling example_macro:lookup_generate() and inserting the parsed results into the compiled code.

We cannot compile this file just yet because we have not implemented EMP1, but hopefully we will soon be able to do this:

1> lists:map(fun(N) -> example:lookup(N) end, lists:seq(0, 3)).
[0,2,4,6]
2>

Hopefully...

Erlang Macro Processor (v1), Part II

I will have to be quite stingy with doling out feature requests if I want to keep this parse_transform module as small - and simple - as possible. EMP2 may be able to guzzle from the feature fountain; EMP1 will be on restricted rations.

I really only need EMP1 for a single specific purpose: to create a binary table in a lookup function at compile-time. Anything extra it can do will be icing on the cake.

There is a bare minimum that EMP1 needs to do in order to qualify as a compile-time macro processor:

Walk the AST of a given module.
On encountering a macro 'trigger', run the designated macro function and retrieve its output.
Parse the macro function's output and convert it into an AST of the Erlang term/s.
Replace the macro trigger in the original AST with the AST of the function's output.

So how can we achieve this with a minimum of effort?

We already know how to walk the AST of a module from our work on the Atomiser, so this part should not be a problem.

The simplest macro 'trigger' I can think of is having '-macro' module attributes contain a {module,function,args} tuple value. These attributes are easy enough to find, and we can use the Erlang 'apply' function to call an arbitrary function.

By only using module attributes as macro triggers the EMP1 module only has to look for (and replace) nodes at the top level of the given AST. EMP1 does not have to walk through (or even know about) any other nodes, so almost all of the walk_ast function clauses that had to be included in the Atomiser can be left out of this module.

The easiest way to implement macro functions is probably for them to return a textual representation of the function that we want them to create. This will actually make it much easier to debug the macros as well: we will be able to print the output and see if it really is the function we are after.

Replacing a macro trigger module attribute node in an AST is as simple as prepending a different node to the outbound AST in the walk_ast function.

The only tricky bit with all of this would be parsing the textual representation of an Erlang term and converting it into an AST, but I am sure there will be something available for this purpose. I have seen some erl_scan and erl_parse modules floating around in the documentation - they or something similar should do.

On the other hand, there are a few caveats that I can think of with implementing EMP1 this way.

As EMP1 is a parse_transform module, it would not be possible to use it to define a standard Erlang substitution macro (using the '-define' keyword). Normal Erlang macros are already expanded into the AST before EMP1 receives it; as far as I am aware there there is no such thing as an AST for a substitution macro. I am not sure if there is anything that can be done about this, unless there is some way of hooking directly into the Erlang preprocessor.

The biggest limitation with using module attributes as the trigger to call a macro is that EMP1 will only be able to build whole functions at a time. Module attributes [only] live at the top level of the AST, and you cannot just insert any old Erlang term up there.

For example, to create a binary lookup table in a function a macro will have to return the whole function with the table embedded rather than just the table by itself. In my case the lookup function will be quite small anyway, so this does not bother me overly much.

Finally, EMP1 will be executing code while the module is being compiled. Any macro code that is to be run must be in a separate (and already compiled) module, which will introduce a dependency in the order in which certain modules can be compiled.

Do any of these limitations worry me?

Hell no!

If EMP1 can achieve the modest goals outlined above then I will consider this project a resounding success.

Wednesday, 11 April 2007

Erlang Macro Processor (v1), Part I

Lisp macros. Sigh.

The power at your disposal when you can write code to generate code is... truly awesome.

Unfortunately (as far as my macrophilia is concerned) I happen to prefer the open-source Erlang/OTP implementation far more than any Lisp I have encountered.

I had sullenly resigned myself to a life filled with mere substitution macros (cue the violins, please) when, like an oyster presented with a piece of grit, I was faced with an irritating problem.

It was a dark and moonless night. Quietly the function gained mass as the the minutes inched past midnight and into the early dawn. Its time-consuming mathematical calculation greedily drained CPU cycles every time it was run, and finally, a lightning-quick strike of doom: a simple call vampirically inserted into an inner loop cemented the function's status as the program's major bottleneck. You could almost hear the evil cackle mocking the developer's feeble resistance.

Naturally the first thing I attempted as a cure to this disease was a bit of memoization to cache this function's results. Unfortunately this approach actually had the exact opposite effect to the one I intended... I suspect that the sheer number of key/value pairs held was causing the dictionary to swap to disk; there was an awful lot of hard drive thrashing going on after I made that change.

The additional fact that this routine had a known, contiguous, sequential input range bugged me too. Why should I need to put up with all of the indexing overhead with storing a dictionary of keys and values, when I only really needed a binary array of values to access by (key) offset?

The next obvious step would have been to create a start function to initialise a binary table with the needed results. I did not like this option very much, though, mainly because I would have had to manage the initialised table by either

passing the binary as an argument to every function that needs it (or calls a function that needs it, ad infinitum), or
storing the binary in the process dictionary (a measure of last resort in a functional program), or
creating a separate process to manage the binary and respond to message requests for its contents, a far more complicated solution than I desired.

Another minor consideration was that an initialisation routine would slow down the beginning of my program every time it was run, even if I was only going to run the program on a small range of test values. This might not be so important during production (barring crash/restart situations), but it can definitely slow down an iterative development cycle.

What I really, really wanted to do was create the lookup binary at compile-time and embed it directly in the calculation function as a literal term. Just the sort of thing I would have done automatically in Lisp, without it even registering as a problem.

"Now," the developer says out loud, scratching his head so the audience knows that he is deep in thought, "What can I do to examine the Abstract Syntax Tree of a program and replace portions of it with new code?"

(cue fanfare)

"This is a job for parse_transform!"

(Programmers are strongly advised not to engage in parse transformations and no support is offered for problems encountered.)

Sunday, 8 April 2007

The Atomiser, Part VII

As promised, here is the full listing of my current atomiser.erl file:

-module(atomiser).
-author("Philip Robinson").
-export([parse_transform/2]).
%-compile({parse_transform, atomiser}). % Uncomment after initial compile.

-atoms([atom, attribute, bin, bin_element, call, 'case', char]).
-atoms([clause, clauses, cons, eof, 'fun', function, generate]).
-atoms(['if', integer, lc, match, nil, op, 'receive', record]).
-atoms([record_field, remote, string, tuple, var]).
-atoms([atoms, error, found, ok]).

parse_transform(AST, _Options) ->
    atoms_unused_print(walk_ast(AST, dict:new())),
    AST.

atoms_from_attribute(Line, AtomList, Atoms) ->
    AddAtom = fun(Atom, Dict) ->
        case dict:find(Atom, Dict) of
            {ok, LineAlreadyDefined} ->
                io:format("Line ~B: Atom ~w already defined on line ~B.~n",
                    [Line, Atom, LineAlreadyDefined]),
                Dict;
            error -> dict:store(Atom, Line, Dict)
            end
        end,
    lists:foldl(AddAtom, Atoms, AtomList).

atom_check(Atom, Line, Atoms) ->
    case dict:find(Atom, Atoms) of
        {ok, found} -> Atoms;
        {ok, _LineDefinedOn} -> dict:store(Atom, found, Atoms);
        error ->
            io:format("Line ~B: Atom ~w unexpected.~n", [Line, Atom]),
            Atoms
        end.

atoms_unused_print(Atoms) ->
    Filter = fun({_Atom, FoundOrDefinedLine}) ->
        FoundOrDefinedLine =/= found
        end,
    PrintUnusedAtom = fun({Atom, Line}) ->
        io:format("Line ~B: Atom ~w unused.~n", [Line, Atom])
        end,
    lists:foreach(PrintUnusedAtom,
        lists:keysort(2, lists:filter(Filter, dict:to_list(Atoms)))).

-define(WALK_AST(Pattern, Expressions),
    walk_ast([Pattern|ASTRest], Atoms) ->
        Fun = fun(AST, AtomsMarked) ->
            walk_ast(AST, AtomsMarked)
            end,
        walk_ast(ASTRest, lists:foldl(Fun, Atoms, Expressions))).

walk_ast([], Atoms) -> Atoms;
walk_ast([{atom,Line,Atom}|RestAST], Atoms) -> % Check whether atom is valid.
    walk_ast(RestAST, atom_check(Atom, Line, Atoms));
walk_ast([{attribute,Line,atoms,AtomList}|RestAST], Atoms) -> % Valid atoms.
    walk_ast(RestAST, atoms_from_attribute(Line, AtomList, Atoms));
?WALK_AST({attribute,_Line,_Name,_Value}, []);
?WALK_AST({bin,_Line,Elements}, [Elements]);
?WALK_AST({bin_element,_Line,_Name,_Size,_Type}, []);
?WALK_AST({call,_Line,_Fun,Args}, [Args]);
?WALK_AST({'case',_Line,Test,Clauses}, [[Test], Clauses]);
?WALK_AST({char,_Line,_Char}, []);
?WALK_AST({clause,_Line,Args,Guards,Exprs}, [Args] ++ Guards ++ [Exprs]);
?WALK_AST({cons,_Line,Head,Tail}, [[Head], [Tail]]);
?WALK_AST({eof,_Line}, []);
?WALK_AST({error,_Details}, []); % Ignore compiler errors.
?WALK_AST({'fun',_Line,{clauses,Clauses}}, [Clauses]);
?WALK_AST({function,_Line,_Fun,_Arity,Clauses}, [Clauses]);
?WALK_AST({generate,_Line,A,B}, [[A, B]]);
?WALK_AST({'if',_Line,Clauses}, [Clauses]);
?WALK_AST({integer,_Line,_Integer}, []);
?WALK_AST({lc,_Line,Head,Tail}, [[Head|Tail]]);
?WALK_AST({match,_Line,Left,Right}, [[Left], [Right]]);
?WALK_AST({nil,_Line}, []);
?WALK_AST({op,_Line,_BinaryOperator,Left,Right}, [[Left], [Right]]);
?WALK_AST({op,_Line,_UnaryOperator,_Operand}, []);
?WALK_AST({'receive',_Line,Clauses}, [Clauses]);
?WALK_AST({'receive',_Line,Clauses1,_TimeAfter,Clauses2}, [Clauses1, Clauses2]);
?WALK_AST({record,_Line,_Record,Fields}, [Fields]);
?WALK_AST({record_field,_Line,Field,Contents}, [[Field,Contents]]);
?WALK_AST({record_field,_Line,_Variable,_Record,Field}, [[Field]]);
?WALK_AST({remote,_Line,_Module,_Function}, []);
?WALK_AST({string,_Line,_String}, []);
?WALK_AST({tuple,_Line,Elements}, [Elements]);
?WALK_AST({var,_Line,_Name}, []);
walk_ast([Node|ASTRest], Atoms) ->
    io:format("Unknown node: ~p~n", [Node]),
    walk_ast(ASTRest, Atoms).

Some final notes, in no particular order:

I am quite pleased with the functionality of the Atomiser, especially considering that it currently weighs in at just under 100 lines of code. I can honestly attribute the relatively small size of this module to the use of the single WALK_AST substitution macro. If this macro had not been used then we would be looking at an increase of 50% in lines of code, at least.

The fact that the Atomiser does not alter the parse tree of the program it is examining made it an ideal project to get used to working with Erlang parse_transform programs. Without support for parse_transform modules I would have had to hack at the source code of the compiler to achieve a similar result... which is not really a viable option if the addition is not accepted into the project.

Obviously in this implementation I have only added walk_ast function clauses for those AST nodes my own programs require; your mileage may vary. If you do run this module over your own code and an unknown node appears then please let me know so I can add an appropriate clause here. Likewise, please feel free to drop me a line with questions, comments, and/or (especially!) suggestions.

Update 10/4/2007:

Yariv Sadan suggested that I take a look at Recless, one of his many ongoing projects (see comments in Part 1). As Yariv did in Recless, I have removed the atoms_from_ast function by rolling the gathering of atoms into the walk_ast function. It saves five lines of code, but more importantly it saves one pass through the top level of the AST. (The down side is that you can no longer expect the Atomiser to validate atoms before the appropriate module attribute is encountered, but I have no problem with that.)

"ayrnieu" posted a link to this series on Reddit, and also mentioned the Dialyzer tool. I have just played with the Dialyzer and have only one thing to say: Use the Dialyzer on your code.

The Atomiser, Part VI

We still have one (optional) requirement left for the Atomiser:

* Have the Erlang compiler tell us if an atom in the valid list is not used.

This can be done quite easily: the atom_check function returns an updated dictionary whenever an atom has been found. All we have to do is go through the valid atoms dictionary and pick up all the atom keys that do not have the value 'found' associated with them, and print them out in line-number order.

atoms_unused_print(Atoms) ->
    Filter = fun({_Atom, FoundOrDefinedLine}) ->
        FoundOrDefinedLine =/= found
        end,
    AtomsUnused = lists:keysort(2, lists:filter(Filter, dict:to_list(Atoms))),
    PrintUnusedAtom = fun({Atom, Line}) ->
        io:format("Line ~B: Atom ~w unused.~n", [Line, Atom])
        end,
    lists:foreach(PrintUnusedAtom, AtomsUnused).

Incidentally, another requirement raised its ugly head during the development of this module. If you ever use the Atomiser when compiling an invalid program, you will receive a horrible 'error' node in the AST. The Atomiser is not interested in these nodes (they will stop the compile process later anyway) and so we just want to ignore them.

A new specialised function clause for walk_ast takes care of this:

?WALK_AST({error,_Details}, []);

Believe it or not, the Atomiser is now pretty much complete.

If you have managed to stay with me so far on this meandering journey, you will probably have noticed that there are a few more minor details we need to clean up before we are finished. The missing pieces are just a few more atoms that need to be identified in the atoms module attributes, and a couple of extra function clauses that need to be added walk_ast. I will leave both of these as an exercise for any impatient readers that cannot wait until the next post, when I will list the entire module in all its naked glory.

The Atomiser, Part V

The Atomiser can now walk its own Abstract Syntax Tree and pluck lists of valid atoms from its own source code.

Next on the list is to

Have the Erlang compiler tell us when we use an atom that is not in the valid list.

Sounds easy: let's get to it!

We want to turn our placeholder atom clause in the walk_ast function into something a little more useful:

walk_ast([{atom,Line,Atom}|RestAST], Atoms) ->
    walk_ast(RestAST, atom_check(Atom, Line, Atoms));

Our new atom_check function will compare the supplied atom against the given atoms dictionary. If the atom exists in the dictionary then a new dictionary will be returned, with that atom's value updated to 'found' (taking the place of the atom's line-of-definition integer). If the atom does not exist in the dictionary then a warning message will be displayed and the original dictionary will be returned:

atom_check(Atom, Line, Atoms) ->
    case dict:find(Atom, Atoms) of
        {ok, found} ->
            Atoms;
        {ok, _LineDefinedOn} ->
            dict:store(Atom, found, Atoms);
        error ->
            io:format(
                "Line ~B: Atom ~w unexpected.~n",
                [Line, Atom]),
            Atoms
        end.

...and we are almost done.

A bit of a shock - but a good sign - is the number of unknown atoms that appear when you first run the Atomiser on itself. All of the atoms we use in our walk_ast pattern tuples show up as unknown atoms, so let's add some proper valid atom lists to the top of the file. First, the atoms we use in our patterns:

-atoms([atom, attribute, call, 'case', clause, clauses, compile]).
-atoms([cons, eof, export, file, 'fun', function, match, module]).
-atoms([nil, remote, string, tuple, var]).

And then some other atoms we are use elsewhere in the code:

-atoms([atoms, error, found, ok]).

And... all the function names are showing up as unknown atoms, too?

Executive decision time!

Since internal function-calls are already picked up by the compiler when they do not match a function in the module, I think it should be safe to modify the 'call' line to:

?WALK_AST({call,_Line,_Fun,Args}, [Args]);

This will skip validating the called function name.

And now,

1> c(atomiser), l(atomiser).
{module,atomiser}
2>

Beautiful.

Saturday, 7 April 2007

The Atomiser, Part IV

The story so far:

Our intrepid (also whiny, and quite lazy) developer suffers from a serious case of fat fingers and failing eyesight. Mistyped Erlang atoms have caused him countless hours of anguish and hair loss, and the compiler does not even have the decency to emit so much as a warning or apology.

In an effort to fill this massive hole in his development tools, the developer has managed to bolt together the bare bones of a parse transformation module. Soon this unnatural creation will be lumbering around the neighbourhood, causing all sorts of havoc.

The Atomiser currently has just enough intelligence to recognise valid atom lists specified in a given source file, but it is not able to actually do anything with that information. We need to walk the AST and locate the atoms used in the source code.

This walk_ast function will be the work-horse of the Atomiser module:

walk_ast([], Atoms) -> Atoms;

walk_ast([Node|ASTRest], Atoms) ->
    io:format("Unknown node:~n~p~n", [Node]),
    walk_ast(ASTRest, Atoms).

And we will modify parse_transform to call it:

parse_transform(AST, _Options) ->
    Atoms = atoms_from_ast(AST),
    _AtomsMarked = walk_ast(AST, Atoms),
    AST.

With this simple walk_ast function we will get a huge output list of 'unknown' Abstract Syntax Tree nodes when we compile atomiser.erl[1]. The last unknown node will appear as:

Unknown node: {eof,41}

Obviously we will need to add more function clauses to cater for the different nodes in the Abstract Syntax Tree. Rather than starting from the top of the output list, I will work my way up from the bottom. (It saves scrolling up... see reference to 'lazy' above.)

Keeping the empty-list and the unknown-node function clauses as the first and last entries respectively, we add a clause entry for the eof node:

walk_ast([{eof,_Line}|ASTRest], Atoms) ->
    walk_ast(ASTRest, Atoms);

Compiling the atomiser.erl program again[2] shows that, indeed, the {eof,41} node is now missing from the 'unknown node' output.

Continuing on we can enter some more function clauses. Something like this:

walk_ast([], Atoms) -> Atoms;
walk_ast([{call,_Line,Fun,Args}|ASTRest], Atoms) ->
    walk_ast(ASTRest, walk_ast(Args, walk_ast([Fun], Atoms)));
walk_ast([{clause,_Line,Args,Guards,Exprs}|ASTRest], Atoms) ->
    walk_ast(ASTRest, walk_ast(Exprs, walk_ast(Guards, walk_ast(Args, Atoms))));
walk_ast([{cons,_Line,Head,Tail}|ASTRest], Atoms) ->
    walk_ast(ASTRest, walk_ast([Tail], walk_ast([Head], Atoms)));
walk_ast([{eof,_Line}|ASTRest], Atoms) ->
    walk_ast(ASTRest, Atoms);
walk_ast([{function,_Line,_Fun,_Arity,Clauses}|ASTRest], Atoms) ->
    walk_ast(ASTRest, walk_ast(Clauses, Atoms));
walk_ast([{nil,_Line}|ASTRest], Atoms) ->
    walk_ast(ASTRest, Atoms);
walk_ast([{var,_Line,_Name}|ASTRest], Atoms) ->
    walk_ast(ASTRest, Atoms);
walk_ast([Node|ASTRest], Atoms) ->
    io:format("Unknown node: ~p~n", [Node]),
    walk_ast(ASTRest, Atoms).

Hang on, I am typing the same stuff over and over again... this is way too inefficient.

It looks like there is a bit of a pattern going on here. Something along the lines of:

walk_ast([PatternTuple|ASTRest], Atoms) ->
    walk_ast(ASTRest, PossiblyNestedCallsToWalkAST(SomeValueInThePattern, Atoms));

Where the possibly-nested calls to walk_ast are based on the elements of the pattern tuple that we are interested in processing recursively.

Now, Erlang does not have a built-in full macro system like Lisp does, but it does at least have basic substitution macros. I think that we can make the code above much nicer to read by using a macro like this:

-define(WALK_AST(Pattern, Expressions),
    walk_ast([Pattern|ASTRest], Atoms) ->
        walk_ast(ASTRest,
            lists:foldl(
                fun(AST, AtomsMarked) ->
                    walk_ast(AST, AtomsMarked)
                    end,
                Atoms,
                Expressions))).

Now that hideous function clause

walk_ast([{clause,_Line,Args,Guards,Exprs}|ASTRest], Atoms) ->
    walk_ast(ASTRest, walk_ast(Exprs, walk_ast(Guards, walk_ast(Args, Atoms))));

may be written like this:

?WALK_AST({clause,_Line,Args,Guards,Exprs}, [Args, Guards, Exprs]);

Continuing to write a function clause for every unknown node in the AST[3], we get:

walk_ast([], Atoms) -> Atoms;
?WALK_AST({atom,_Line,Atom}, []); % Need to check for valid atom.
?WALK_AST({attribute,_Line,file,_File}, []);
?WALK_AST({attribute,_Line,module,_Module}, []);
?WALK_AST({attribute,_Line,export,_ExportList}, []);
?WALK_AST({attribute,_Line,compile,_CompilerDirective}, []);
?WALK_AST({attribute,_Line,atoms,_AtomList}, []);
?WALK_AST({call,_Line,Fun,Args}, [[Fun], Args]);
?WALK_AST({'case',_Line,Test,Clauses}, [[Test], Clauses]);
?WALK_AST({clause,_Line,Args,Guards,Exprs}, [Args, Guards, Exprs]);
?WALK_AST({cons,_Line,Head,Tail}, [[Head], [Tail]]);
?WALK_AST({eof,_Line}, []);
?WALK_AST({'fun',_Line,{clauses,Clauses}}, [Clauses]);
?WALK_AST({function,_Line,_Fun,_Arity,Clauses}, [Clauses]);
?WALK_AST({match,_Line,Left,Right}, [[Left], [Right]]);
?WALK_AST({nil,_Line}, []);
?WALK_AST({remote,_Line,_Module,_Function}, []);
?WALK_AST({string,_Line,_String}, []);
?WALK_AST({tuple,_Line,Elements}, [Elements]);
?WALK_AST({var,_Line,_Name}, []);
walk_ast([Node|ASTRest], Atoms) ->
    io:format("Unknown node: ~p~n", [Node]),
    walk_ast(ASTRest, Atoms).

And this is all we need to walk through the Atomiser's current code, with no unknown node messages appearing.

[1] We need to compile atomiser.erl twice to pick up the changes and see them in action. The first time the compilation is performed the new code is compiled and loaded. The second time the compilation is performed the new code is actually run against the source.

[2] Twice!

[3] Rather than consisting of a bunch of pattern matching clauses, the walk_ast function could be made "smarter" by transforming the given node tuple into a list, and applying some rules-based logic to the elements of that list (from the third element onwards). I chose to do things the function-clause way mainly because I am not completely familiar with all of the elements in an Erlang AST. Having newly-encountered nodes come up as unknown elements is a good way to be exposed to the underlying structure of the parse tree, and to ensure that we have caught everything we need to.

Friday, 6 April 2007

The Atomiser, Part III

Right, so we have a skeleton parse transformation module that emits a message and returns the unchanged Abstract Syntax Tree. What do we do now?

There are a few requirements listed in Part II; we might as well start from the top:

* Ability to include a list of valid atoms into a source file.

Implicit in this requirement is that the Atomiser will actually be able to read this list and do something with it. Let's work on that.

To begin with we want to embed our list of valid atoms in our source code. Unfortunately we cannot use just any old syntax to do this - the syntax we choose must be parse-able by the existing Erlang compiler so that it can build us an Abstract Syntax Tree to play with.

Luckily we can use a module attribute to specify our list of valid atoms. For no particular reason I will name this attribute 'atoms':

-atoms([atom1, atom2, atom2, atom4]).

The Atomiser will do a quick pass through the top level of the Abstract Syntax Tree to pull in all the 'atoms' module attributes, storing their contents in a dictionary of valid atoms. The atom names themselves will be the keys of this dictionary; the line number of the attribute that specified the atom will be the value stored against each atom key. (We will use these line numbers for reporting later, if a specified atom is unused.)

Here is a function to scan an AST and print all the atoms attributes it finds:

atoms_find([]) ->
    ok;
atoms_find([{attribute,Line,atoms,AtomList}|ASTRest]) ->
    io:format("Found atom list on line ~B: ~p~n", [Line, AtomList]),
    atoms_find(ASTRest);
atoms_find([_Node|ASTRest]) ->
    atoms_find(ASTRest).

And with a slight modification, instead of printing them out we can store the atoms we find in a dictionary:

atoms_from_ast(AST) ->
    atoms_from_ast(AST, dict:new()).

atoms_from_ast([], Atoms) ->
    Atoms;
atoms_from_ast([{attribute,Line,atoms,AtomList}|ASTRest], Atoms) ->
    atoms_from_ast(ASTRest, atoms_from_attribute(Line, AtomList, Atoms));
atoms_from_ast([_|ASTRest], Atoms) ->
    atoms_from_ast(ASTRest, Atoms).

Neat, huh?

Well, okay, I haven't yet added the code to extract the atoms from an atoms attribute and store them in the dictionary. It sounds a bit complicated...

atoms_from_attribute(Line, AtomList, Atoms) ->
    AddAtom = fun(Atom, Dict) ->
        dict:store(Atom, Line, Dict)
        end,
    lists:foldl(AddAtom, Atoms, AtomList).

...well, maybe that is not too complicated after all.

Oh, wait! If I make the Atomiser report on atoms that have already been specified as valid, then I can totally make this look impressive:

atoms_from_attribute(Line, AtomList, Atoms) ->
    AddAtom = fun(Atom, Dict) ->
        case dict:find(Atom, Dict) of
            {ok, LineAlreadyDefined} ->
                io:format(
                    "Line ~B: Atom ~w already defined on line ~B.~n",
                    [Line, Atom, LineAlreadyDefined]),
                Dict;
            error -> dict:store(Atom, Line, Dict)
            end
        end,
    lists:foldl(AddAtom, Atoms, AtomList).

There we go. Now the Atomiser will let us know if we have accidentally specified a valid atom more than once.

We can try this out now. Here is the full listing of the Atomiser so far:

-module(atomiser).
-export([parse_transform/2]).
-compile({parse_transform, atomiser}). % Comment out for initial compile.

-atoms([atom1, atom2, atom2, atom4]).

parse_transform(AST, _Options) ->
    Atoms = atoms_from_ast(AST),
    io:format("Retrieved these valid atoms: ~p~n", [dict:fetch_keys(Atoms)]),
    AST.

atoms_from_ast(AST) ->
    atoms_from_ast(AST, dict:new()).

atoms_from_ast([], Atoms) ->
    Atoms;
atoms_from_ast([{attribute,Line,atoms,AtomList}|ASTRest], Atoms) ->
    atoms_from_ast(ASTRest, atoms_from_attribute(Line, AtomList, Atoms));
atoms_from_ast([_|ASTRest], Atoms) ->
    atoms_from_ast(ASTRest, Atoms).

atoms_from_attribute(Line, AtomList, Atoms) ->
    AddAtom = fun(Atom, Dict) ->
        case dict:find(Atom, Dict) of
            {ok, LineAlreadyDefined} ->
                io:format(
                    "Line ~B: Atom ~w already defined on line ~B.~n",
                    [Line, Atom, LineAlreadyDefined]),
                Dict;
            error -> dict:store(Atom, Line, Dict)
            end
        end,
    lists:foldl(AddAtom, Atoms, AtomList).

Compiling this atomiser.erl file should give us the list of the three unique atoms specified, and a complaint about atom2 occurring twice:

1> c(atomiser), l(atomiser).
Line 5: Atom atom2 already defined on line 5.
Retrieved these valid atoms: [atom1,atom2,atom4]
{module,atomiser}
2>

This thing had better do some real work soon... it is already past thirty lines of code!

Obligatory legal stuff

Unless otherwise noted, all code appearing on this blog is released into the public domain and provided "as-is", without any warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the author(s) be liable for any claim, damages, or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.

This is me