This is me

Tuesday, 19 June 2007

Collaborative Filtering: Weighted Slope One in Erlang

I have been toying with the Netflix Challenge for a while now. It's fascinating stuff.

I knew nothing about collaborative filtering when I started this project, but that it pretty normal for my coding hobbies. (Hey, why would I start something new if I already knew how to do it?)

During my standard "collect and absorb everything" phase I ran across an article on Wikipedia that described the Slope One algorithm. This article had links to various implementations of the algorithm, including a standalone Java program by Daniel Lemire. More information on the Weighted Slope One algorithm may be found on Daniel's site.

I am not using Slope One in my current Netflix algorithm attempt, but I translated Daniel's Java code into Erlang as a learning exercise anyway:

``%% slopeone.erl% Philip Robinson% A simple implementation of the weighted slope one% algorithm in Erlang for item-based collaborative% filtering.% Based on the same algorithm in Java by Daniel Lemire and Marco Ponzi:% http://www.daniel-lemire.com/fr/documents/publications/SlopeOne.java-module(slopeone).-export([start/0]).start() ->    % The rating database: A list of users, each containing a list of {item, rating} elements.    Items = [{item1,"candy"}, {item2,"dog"}, {item3,"cat"}, {item4,"war"}, {item5,"strange food"}],    DataRating = [{user1, "Bob",       [{item1,1.0}, {item2,0.5}, {item4,0.1}]},                  {user2, "Jane",      [{item1,1.0}, {item3,0.5}, {item4,0.2}]},                  {user3, "Jo",        [{item1,0.9}, {item2,0.4}, {item3,0.5}, {item4,0.1}]},                  {user4, "StrangeJo", [{item1,0.1}, {item4,1.0}, {item5,0.4}]}],    % The difference & frequency database: an ETS table of {{item X, itemY}, diff, freq}.    ets:new(diff_freq, [private, set, named_table]),    % Create the predictor engine.    build_matrix(DataRating),    io:format("Here's the data I have accumulated...~n"),    print_data(Items, DataRating),    io:format("Ok, now we predict...~n"),    TestRatings1 = [{item5,0.4}],    io:format("Inputting...~n"),    print_user_ratings(Items, TestRatings1),    io:format("Getting...~n"),    print_user_ratings(Items, predict(Items, TestRatings1)),    TestRatings2 = [{item4,0.2}|TestRatings1],    io:format("Inputting...~n"),    print_user_ratings(Items, TestRatings2),    io:format("Getting...~n"),    print_user_ratings(Items, predict(Items, TestRatings2)),    ets:delete(diff_freq).% Based on existing data, and using weights, try to predict all missing ratings.% The trick to make this more scalable is to consider only diff_freq entries% having a large (> 1) frequency entry.predict(Items, ItemRatings) ->    PredFreq = ets:new(pred_freq, []),    [ets:insert(PredFreq, {Item, 0.0, 0}) || {Item, _} <- Items],    [[case {ets:match(diff_freq, {{ItemY, ItemX}, '\$1', '\$2'}),            ets:match(PredFreq, {ItemY, '\$1', '\$2'})} of        {[], _} -> ok;        {[[Diff, DFreq]], [[Pred, PFreq]]} ->            ets:insert(PredFreq, {ItemY, Pred + ((RatingX + Diff) * DFreq), PFreq + DFreq})        end || {ItemY, _} <- Items] || {ItemX, RatingX} <- ItemRatings],    ets:match_delete(PredFreq, {'_', '_', 0}), % Remove all zero-frequency predictions.    [ets:insert(PredFreq, {Item, Rating, 1}) || {Item, Rating} <- ItemRatings], % Re-insert actual ratings.    Results = [{Item, Rating / Freq} || {Item, Rating, Freq} <- ets:tab2list(PredFreq)],    ets:delete(PredFreq),    Results.print_data(Items, DataRating) ->    [begin io:format("~s~n", [Name]),        print_user_ratings(Items, ItemRatings) end        || {_Id, Name, ItemRatings} <- DataRating],    [print_item_diffs(Items, Item) || Item <- Items],    io:format("~n").print_user_ratings(Items, ItemRatings) ->    [begin {value, {Item, NameItem}} = lists:keysearch(Item, 1, Items),        io:format(" ~12s --> ~4.2f~n", [NameItem, Rating]) end        || {Item, Rating} <- lists:sort(ItemRatings)].print_item_diffs(Items, {Item, Name}) ->    io:format("~n~12s:", [Name]),    [case ets:match(diff_freq, {{Item, Id}, '\$1', '\$2'}) of        [] -> io:format("          ");        [[Diff, Freq]] -> io:format("  ~6.3f ~1B", [Diff, Freq])        end || {Id, _} <- Items].% Build a matrix (ETS table) of {{itemX, itemY}, Difference, Frequency} entries.build_matrix(DataRating) ->    % Gather the sum difference and the total count (frequency).    [[[case ets:lookup(diff_freq, {ItemX, ItemY}) of        [] -> ets:insert(diff_freq, {{ItemX, ItemY}, RatingX - RatingY, 1});        [{Key, Diff, Freq}] -> ets:insert(diff_freq, {Key, Diff + RatingX - RatingY, Freq + 1})        end || {ItemY, RatingY} <- ItemRatings]            || {ItemX, RatingX} <- ItemRatings]            || {_, _, ItemRatings} <- DataRating],    % Divide sum of difference by frequency to get mean difference.    DivideFun = fun({Key, Diff, Freq}, _) -> ets:insert(diff_freq, {Key, Diff / Freq, Freq}) end,    ets:foldl(DivideFun, undefined, diff_freq).``

Some musings:

* I do not really consider this code to be a typical "Erlang-style" program. In particular, I am making substantial use of side-effects via ETS tables; comparisons between this code and the Java original will probably not be representative of comparisons between Erlang and Java programs in general.

* I may have gone a little overboard with the use of list comprehensions. I did not really like LCs when first exposed to them and it seems that I am overcompensating for that now.

* While some functions are obviously shorter in this Erlang version (compare `build_matrix` and `buildDiffMatrix`, for example), I am not convinced that they are necessarily clearer in Erlang than in Java. At least one advantage of smaller functions is that I can fit more of them on my 8.9" screen, but if that was my main concern then I would be programming in something even less verbose.

* While I haven't delved into this much, I suspect that using ETS has made this program harder to parallelise effectively. While the loops could easily be converted to use a parallel map algorithm, all CRUD activity still has to go through a single ETS process bottleneck. One possible way of utilising multiple CPUs might be to spawn a bunch of processes to manage individual items, send each of these messages regarding specific ratings, and then convert the results into a dictionary via a list comprehension of `receive` statements and `dict:from_list/1`.

* I did run the Dialyzer over the code but I have not even considered optimising it for performance. It will be slower than the Java version. :-)

1. I ran into a similar, dissatisfying use of ETSes when I translated an implementation of Bayes' theorem from another language into Erlang. I'll be interested to hear if you can find a more parallelizable, side-effect free way to implement these sorts of algorithms.

2. I also wrote an implementation of Slope-One in Erlang, using Erlang dictionaries. I think it make the code more readable (more Java like), but was thinking to rewrite it using ETS to improve efficiency (or even Mnesia to distribute it easily).

I think using Erlang for collaborative filtering make sense because it enable to distribute a system for real time data processing (online learning). Is it your plan to use Erlang for the other CF project?

3. Tony: I am currently writing up a blog post for a second version of this program. This new version uses an improbable number of spawns and dictionaries instead of ETS tables.

Khigia: My understanding is that ETS tables would be faster than dictionaries if you only have a single CPU, but that the bottleneck of passing everything through one ETS process would probably slow down a concurrent algorithm. I haven't actually measured this, though, so I could be completely wrong. (If so, I would be more than happy for someone to correct me).

I am indeed also using Erlang for my Netflix CF project.

In fact my current implementation uses a simple dict-of-dict kind of structure (matrix is a dict indexed by row, one row being a dict indexed by column) ... so this is not easy to distribute.
My idea of using ETS (or mnesia) is to distribute the work per row or column between processes, each process accessing different part of the table. Do you think it would be a bottleneck to access this common table even if processes access different part of it?
I'm looking forward for your next post with lot of spawn! ;)

5. Khigia: You have probably seen my next post by now, where I have parallelised the creation of the diff/freq dictionary. You could use a similar technique to parallelise the creation of a dict-of-dicts by spawning one process per row-dict, then inserting finished row-dicts into a parent dictionary at the end.

I would be very careful with using plain ETS in a concurrent environment. The ETS module documentation even states that "this module provides very limited support for concurrent updates".

I do think that a shared ETS table will be a bottleneck for multiple processes, but it really does depend on the program and hardware as to whether this is even worth thinking about. For instance, my Netflix code spends far more time in calculation code than it does accessing and updating data in Mnesia tables. (If I had access to, say, 80 CPUs then it might be a different story...)

6. good!