Re: Storing large collections in Riak (or any distributed store)

Alexander Sicular Wed, 09 Feb 2011 10:19:19 -0800

Not surprising, re. Flickr. Don't be too clever when disk is cheap andonly getting cheaper. Remember, we in nosql land where denormilizationis ... the norm.


@siculars on twitter
http://siculars.posterous.com

Sent from my iPhone

On Feb 9, 2011, at 13:12, Jeremiah Peschka<jeremiah.pesc...@gmail.com> wrote:

Incidentally, this is also how flickr handles writes - when youupload a photo it gets written to wherever your other photos go.When someone tags it or adds it to a group, it gets copied into thatgroup.

Unless, of course, it's all changed since the last time I looked forinformation about how flickr actually works.


--
Jeremiah Peschka
Sent with Sparrow
On Wednesday, February 9, 2011 at 10:09 AM, Alexander Sicular wrote:

The only way this is functional is if you implement a uniformlyrandom

hash function so that you know which key any given address will hash
to. Separately, churn will eat you up if you constantly need to take

addys out of your keys. Also, as mentioned elsewhere, Riak linkswon't

work at these numbers.

Check out more tech slides/blogs on how twitter does this. Basically
double/reverse/reciprocal look up lists with recipient notification.
When @aplusk tweets twitter does like 4million (or however many

followers he has) writes. I recommend @rk's qcon sf 2010 talk onnosql

at twitter.

Best, alexander

On 2011-02-09, Scott Lystig Fritchie <fritc...@snookles.com> wrote:

Nathan Sobo <ns...@pivotallabs.com> wrote:

ns> Is a key-value store actually inappropriate for this problem?

No. One way to do it is to use a single KV key to store multiple
addresses worth of info. Pick a relatively big number, 50K
subscribers/key, though it may vary. Use a key naming scheme so that
you can pre-calculate all keys for a given list, e.g. bucket =
list-subscribers, key = name + range index #, or perhaps list name
+ start-of-hash-range + end-of-hash-range.

How do you know the range index # or start & ends of range? Onemethodwould be hashing, MD5 or SHA1 or whatever. If you store alladdressesfor a list with a fixed number of hash hunks, e.g. 100, then eachhash

hunk will have roughly 20K entries for a 2M subscriber list. To find
all subscribers, fetch 100 known keys.

If you want to keep addresses in sorted order, it's more work butalso

doable. A naive plan is to make your hash function F(addr) = first
letter of 'addr'. Keys get clumpy that way, but only slightly more
creativity can get around it.

To find a particular subscriber, hash that subscriber's address and

fetch 1 key. You're also getting a lot of uninteresting data withthat

key's value, but if event is uncommon, it's not a problem. (If that
event actually is common, consider moving the commonly-queried data
elsewhere. Or duplicate that info in another smaller key somewhere
else.) Similar logic for list maintenance events (add subscriber,
delete subscriber).

-Scott

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


--
Sent from my mobile device

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com


_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Re: Storing large collections in Riak (or any distributed store)

Reply via email to