Sunday 8 July 2007

Erlang and the Very Large Binary


Update 28/07/2007: The issue with pattern-matching on very large binaries has been resolved in R11B5, so the workaround below is no longer needed. Nothing to see here folks, move along... many thanks to the Erlang OTP team for clearing this up.

The following post has been kept purely for posterity.

(Also, see Per's comment where a better workaround than mine is suggested.)

----------


I have these binary files I created from the Netflix data. Some of them are quite large, so for peace of mind I had to do a quick check to see whether Erlang could handle binaries of that size.

It turns out that Erlang can indeed handle some reasonably large binary sizes. Sort of. There was certainly no problem with loading a 300MB binary into RAM. Accessing the elements of this binary, however, proved to be somewhat of a problem.

I had written a simple helper function to manage retrieving elements from memory- or file-based binaries[1]:

bin_get(BytesOffset, BytesToRead, Fd={file_descriptor,prim_file,_}) ->
{ok,Element} = file:pread(Fd, BytesOffset, BytesToRead),
Element;
bin_get(BytesOffset, BytesToRead, Bin) ->
<<_:bytesoffset/binary, Element:BytesToRead/binary>> = Bin,
Element.


For relatively small BytesOffset values everything worked as expected. But as soon as I tried an offset of 134,217,728 bytes or higher I received a badmatch error from bin_get/3... but only for memory-based binaries. Opening a file descriptor to the same binary and retrieving the same offset value worked just fine, if a bit slower.

There appears to be a maximum element size limit of 2^27 - 1 for binary pattern matching.[2]


There was a simple workaround for this limit - all I needed to do was have a few extra clauses in bin_get/3 and insert multiple anonymous elements into the binary pattern match where needed. Since my largest binary is just a bit over 300MB I could get away with three clauses:

bin_get(BytesOffset, BytesToRead, Bin) when BytesOffset =< 134217727 ->
<<_:BytesOfset/binary, Element:BytesToRead/binary>> = Bin,
Element;
bin_get(BytesOffset, BytesToRead, Bin) when BytesOffset =< 268435454 ->
BytesOffset2 = BytesOffset - 134217727,
<<_:134217727/binary, _:BytesOffset2/binary, Element:BytesToRead/binary>> = Bin,
Element;
bin_get(BytesOffset, BytesToRead, Bin) ->
BytesOffset2 = BytesOffset - 268435454,
<<_:134217727/binary, _:134217727/binary, _:BytesOffset2/binary, Element:BytesToRead/binary>> = Bin,
Element.


I was happy to let a call for an offset greater than 402,653,181 to produce a runtime badmatch error, but not so happy to discover that the code above produced a compile error:

beam/beam_load.c(1551): Error loading function test:bin_get/3: op bs_skip_bits2 f x w u u: no specific operation found
{error,badfile}


After judicious use of the 'comment out lines of code and recompile the program' debugging technique, I determined that the Erlang compiler really did not like having the two initial anonymous elements in that last binary pattern match. Even turning the underscores into named (but ignored) variables gave the same result.


The solution to the problem raised by my solution to the initial problem was to go recursive:

-define(MAX_BIN_ELEM_SIZE, 134217727).
bin_get(BytesOffset, BytesToRead, Bin) when BytesOffset =< ?MAX_BIN_ELEM_SIZE ->
<<_:BytesOffset/binary, Element:BytesToRead/binary>> = Bin,
Element;
bin_get(BytesOffset, BytesToRead, <<_:?MAX_BIN_ELEM_SIZE/binary,Bin/binary>>) ->
bin_get(BytesOffset - ?MAX_BIN_ELEM_SIZE, BytesToRead, Bin).


In other words, keep discarding the first 'magic number' bytes from the binary and subtract the magic number from the offset until our offset is equal to or less than the magic number, then access the element in the binary in the usual manner.


I am not particularly happy with the need for this hack, but the end result lets me use some large in-memory binaries instead of constantly seeking and reading from disk. (If this last attempt hadn't worked then I was going to try breaking the large binaries into 100-MB chunks, and that would have been a much uglier workaround.)



[1] Depending on how much RAM the machine I was running the code on had, I set certain read-only binary files to be either loaded straight into memory or to just have a file handle opened for them. bin_get/3 was written to abstract that difference away from the rest of the code. I would have gotten away with it, too, if it weren't for those pesky errors.

[2] This would be consistent with using a 32-bit word to store the size value (with 4 of those bits used for a type identifier and 1 bit for a sign indicator). I would expect the element limit on a 64-bit architecture to be somewhat larger, and I wouldn't have noticed this problem with the size of the binaries I am currently using.

Obligatory legal stuff

Unless otherwise noted, all code appearing on this blog is released into the public domain and provided "as-is", without any warranty of any kind, express or implied, including but not limited to the warranties of merchantability, fitness for a particular purpose and noninfringement. In no event shall the author(s) be liable for any claim, damages, or other liability, whether in an action of contract, tort or otherwise, arising from, out of or in connection with the software or the use or other dealings in the software.