Sure.

To clarify, Riak mapreduce is decent. We store hundreds of millions of objects without trouble, and mapreduce over hundreds for many requests with decent (50-500ms) latencies.

It's just not the best for job over millions of keys; it will take much longer than a comparable job implemented in, say, Hadoop. It's also difficult to debug MR in riak--but it's difficult to debug Hadoop as well. If either *could* work, the answer probably falls down to "do you have the man-hours and expertise necessary to keep hadoop happy".

Riak can also collapse in horrible ways when asked to list huge numbers of keys. Some people say it just gets slow on their large installations. We've actually seen it hang the cluster altogether. Try it and find out! Basho understands this and is aiming to address it, but I've heard no specific timetable or plans. Meanwhile we pull keys out of the underlying storage directly, and cache them in Redis. That may be a viable solution for you.

Mecha is something experimental that John Mullerleile is working on.

http://www.slideshare.net/jmuellerleile/scaling-with-riak-at-showyou

Basically, it's a new backend for Riak (if you weren't aware, Riak has pluggable storage backends). You still read and write to Riak as normal, but underneath the hood, it stores the data in leveldb (one per partition per vnode), and *also* indexes specially named fields in a local solr core on each node. Using the coverage code in Riak 1.0, we can then issue a solr query to some subset of nodes and receive a response for all the values stored in Riak. You can filter, count, facet, etc by text, numbers, multivalued texts, geolocation, etc. I would describe it as "scary fast".

Downside is it's also experimental, and glues together a lot of different technologies. All those moving parts means we haven't had time to package it up and open-source it yet, but sometime in December or January we're hoping to focus on polish and release.

--Kyle

On 11/28/2011 02:59 PM, Michael Dungan wrote:
Thank you for getting back to me. It does look like we'll be needing to
go big, as we're already at 5m new records/month, so just dealing with
monthly numbers is already beyond the few hundred thousand keys you
mentioned, unless I'm thinking about this wrong.

I would love to hear more about Mecha if you're willing to share. Feel
free to contact me off-list.

thanks again,

-mike


On 11/28/11 2:24 PM, Aphyr wrote:
For limited mapreduce (where you know the keys in advance) riak would be
a fine choice. 500 million keys, n val 3 is readily achievable on
commodity hardware; say four nodes with 128GB SSDs.

If large-scale mapreduce (more than a few hundred thousand keys) is
important, or listing keys is critical, you might consider HBase.

If you start hitting latency/write bottlenecks, it may be worth
accumulating metrics in Redis before flushing them to disk.

At Showyou, we're also building a custom backend called Mecha which
integrates Riak and SOLR, specifically for this kind of analytics over
billions of keys. We haven't packaged it for open-source release yet,
but it might be worth talking about off-list.

--Kyle

On 11/28/2011 02:07 PM, Michael Dungan wrote:
Hi,

Sorry if this has been asked before - I couldn't find a searchable
archive of this list.

I was told to ask this list whether or not Riak would be appropriate for
tracking our site's metrics. We are currently using Redis for this but
are at the point where we need both clustering and m/r capability, and
on the surface, Riak looks to fit this bill (we already use Erlang
elsewhere in our app, so that's an additional plus).

The records are pretty small and can be representated easily in json. An
example:

{
"id": "c4473dc5cfc5da53831d47c4c016d1c7de0a31e4fd94229e47ade569ef011a7b"
"type": "Photo::Click",
"user_id": 2640,
"photo_id": 255,
"ip": "100.101.102.103",
"created_at": "2011/04/08 17:09:40 -0700"
}

We currently have around 25 million records similar to this one, and are
adding 4-5 million more each month.

Is Riak appropriate for this use case? Are there any gotchas I need to
be aware of?

thank you,

-mike


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to