Hello! I’m Robert Kaye from the MetaBrainz Foundation — we’re the people behind MusicBrainz ( https://musicbrainz.org <https://musicbrainz.org/> ) and more recently ListenBrainz ( https://listenbrainz.org <https://listenbrainz.org/> ). ListenBrainz is aiming to re-create what last.fm <http://last.fm/> used to be — we’ve already got 200M listens (AKA scrabbles) from our users (which is not a lot, really). We’ve setup an Apache Spark cluster and are starting to build user listening statistics using this setup.
While our setup is working, we can see that we’re not going to scale up well given our current approach. We’ve been trying to read the docs, ask for help on the IRC channel, but we continue to miss import bits about how we should be doing things. Best practices around Spark seem to be hard to come by. :( MetaBrainz is all open source and open data — any of the data we use is available for anyone to download — we’re a non-profit working hard towards creating open source music recommendation engines. We’re hoping that someone could take us under their wing, turn up in our IRC channel and help us find the right path towards using Spark much more effectively than we’ve been so far. Is anyone on this list interested in helping out? Perhaps you know someone who might? Thanks! -- --ruaok Robert Kaye -- r...@metabrainz.org -- http://metabrainz.org