Friday, 13 April 2007

Erlang Macro Processor (v1), Part IV

We know what we want, we know how we want to use it, and so without further ado, here it is: the code for EMP1.

-author("Philip Robinson").

parse_transform(ASTIn, _Options) -> walk_ast(ASTIn, []).

walk_ast([], ASTOut) -> lists:reverse(ASTOut);
walk_ast([{attribute,Line,macro,{Mod,Fun,Args}}|RestASTIn], ASTOut) ->
    ReversedResults =
            lists:flatten([apply(Mod,Fun,Args)|[" "]]),
            Line, []),
    walk_ast(RestASTIn, ReversedResults ++ ASTOut);
walk_ast([Node|ASTInRest], ASTOut) -> walk_ast(ASTInRest, [Node|ASTOut]).

ast_reversed_results(ResultsString, LineStart, ASTResults) ->
    case string_trim_whitespace(ResultsString) of
        "" -> ASTResults;
        String ->
            {done,{ok,Tokens,LineEnd},StringRest} =
                erl_scan:tokens([], String, LineStart),
            {ok, AST} = erl_parse:parse_form(Tokens),
            ast_reversed_results(StringRest, LineEnd, [AST|ASTResults])

string_trim_whitespace([ 9|String]) -> string_trim_whitespace(String);
string_trim_whitespace([10|String]) -> string_trim_whitespace(String);
string_trim_whitespace([32|String]) -> string_trim_whitespace(String);
string_trim_whitespace( String ) -> String.

And that is it - 30 lines of code.

No, really. That is all there is. I can take you through it in some detail, if you want. Fasten your seat-belts.

EMP1 In Detail

I will start with the "mostest ugliest" piece of Erlang code I have ever written: the string_trim_whitespace function.

This function returns the given string minus any leading tabs, carriage returns, or spaces. I searched the Erlang documentation and the Trap Exit website but I did not manage to find any built-in functions that achieved the same goal. Four lines of code seems a bit excessive for what it actually does and I am sure there must be a nicer way of writing it.

This function is actually a reasonably good example of Erlang pattern-matching and tail-recursion at work. If the given string begins with a tab (ASCII 9), carriage return (ASCII 10), or a space (ASCII 32), then it will match one of the first three function clauses. The first character will be dropped and the function recursively called with the rest of the string.

If the string does not match any of those three function clauses then it must not have a tab, carriage return, or space at the beginning, so the given string is returned as-is. This even works for the empty string. (Technically it would also match any non-string argument - integer, float, tuple, or whatever - and just return the input given.)

Even though the function uses recursion there is no danger of the stack blowing out no matter how large the string is. Erlang (like most functional languages) has a neat trick of turning tail recursion calls into a goto loop so the function executes in constant memory space. Others have explained tail-recursion better than I can, so let's move on...

Next on the list is the walk_ast function, which runs through the top level of the inbound AST and builds an outbound AST. The outbound AST list is built in reverse order to take advantage of the cheap list 'cons' operation: it is very quick to add or remove an element at the beginning of a list but much more expensive to add or remove an element at the end of a list. When the whole inbound AST has been processed (the argument matches the empty list) then the outbound AST is run through the lists:reverse function to switch it back to the right-way-around order again. If you are not yet familiar with this build-in-reverse-then-switch idiom, you soon will be. :-)

There are only three function clauses in this walk_ast function: The final case where we reverse and return the new AST, processing a 'macro' module attribute, and everything else.

The 'final' case I have covered above, and the 'everything else' case just passes the node straight to the outbound AST. The magic of EMP1 happens in the macro module attribute clause.

The walk_ast function looks for macro atributes of this form:

-macro({Module, Function, Args}).

When it finds a match it calls the module/function with the given args and captures the result, which should be a string representation of an Erlang function. It adds this string to the beginning of a list containing a single space and then flattens the total result.

A space is added to the end of the return string because erl_scan:tokens has a problem parsing something like "42." - it cannot tell if this is the beginning of a floating-point number. To avoid this I add a space to the end of the string; erl_scan:tokens knows that "42. " is just the integer 42.

The resulting string is also flattened because io_lib:format does some funny things when you use "~s" to embed a value string into a format string. For example, io_lib:format("ab~se", ["cd"]) produces [97,98,"cd",101] instead of an expected (in my opinion) "abcde". This might be okay for printing, which I presume flattens its input as it goes, but this is a terrible format for erl_scan to tokenise.

Once the macro's return string has been mutilated enough it is passed on to ast_reversed_results, for some further mangling.

The ast_reversed_results function does pretty much all the heavy lifting for the module. It takes in the current result string (a flattened text representation of one or more Erlang functions with a space at the end), the line the module attribute was declared, and the current AST list of processed results (in reversed order as per the functional programming idiom mentioned above).

The very first thing this function does is to strip leading whitespace characters from the input string, and test that result against the empty string.

For some reason erl_scan:tokens returns a {more, SomeWeirdStuff} tuple when it is handed a string of whitespace characters (and also when given the empty string). I have no idea what I should do with this result so I strip the leading whitespace characters out and test for the empty string instead.

If the stripped string is not empty then we want to tokenise and parse the first form (which should be a function definition), add the parsed results to the beginning of our AST list, and try again with the rest of the string (as it is possible to include more than one function definition in the macro return string).

If the stripped string is empty then there is nothing left to process and we can return the (reversed) AST of result. We keep these in reversed order because it is just pre-pended to the walk_ast's ASTOut, and it will all be re-reversed at the end.


EMP1 Epilogue and Notes

* An interesting 'feature' of EMP1 is that it may be used to create functions where the function name is programmatically generated. I am not sure why you might choose to create a whole bunch of separate, named functions over, say, creating one function with multiple clauses triggered by an atom argument, but EMP1 certainly makes it possible.

* I would recommend avoiding carriage returns in macro output strings. It does not actually break anything, but it tends to obfuscate the stack trace output of any runtime exceptions thrown from the generated code.

* One advantage of compile-time macros over run-time function-building techniques is that the usual compiler checks are run over the generated code. (The macro-created code is actually there at compile-time rather than appearing later at run-time.) I like to get my bug reports early, and if the compiler can complain then I don't need to wait for unit tests to raise an issue.

Using compile-time macros also means that static code analysis tools such as the Dialyzer will include the generated functions in its analysis and report.

There are, however, situations where not all of the information needed to create a function is available at compile-time. If you find yourself in such a predicament you might want to check out Yariv's smerl project, which makes it a lot easier to do runtime meta-programming.

I might need to use smerl when I write EMP2.


  1. philip,
    very nice article.
    just thought i'd contribute a different/better string_trim_whitespace/1.

    string_trim_whitespace(Str) ->
    string_trim(Str,Bad) ->
    lists:dropwhile(fun(X)->lists:member(X,Bad) end,Str).

  2. Thank you for that suggestion, Mats. I had a suspicion that my version was too awkward to be "correct Erlang". I can see that I will have to make myself more familiar with the other higher-order functions in the lists module.


Obligatory legal stuff

Unless otherwise noted, all code appearing on this blog is released into the public domain and provided "as-is", without any warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the author(s) be liable for any claim, damages, or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.