Not surprising, re. Flickr. Don't be too clever when disk is cheap and only getting cheaper. Remember, we in nosql land where denormilization is ... the norm.

@siculars on twitter
http://siculars.posterous.com

Sent from my iPhone

On Feb 9, 2011, at 13:12, Jeremiah Peschka <jeremiah.pesc...@gmail.com> wrote:

Incidentally, this is also how flickr handles writes - when you upload a photo it gets written to wherever your other photos go. When someone tags it or adds it to a group, it gets copied into that group.

Unless, of course, it's all changed since the last time I looked for information about how flickr actually works.

--
Jeremiah Peschka
Sent with Sparrow
On Wednesday, February 9, 2011 at 10:09 AM, Alexander Sicular wrote:

The only way this is functional is if you implement a uniformly random
hash function so that you know which key any given address will hash
to. Separately, churn will eat you up if you constantly need to take
addys out of your keys. Also, as mentioned elsewhere, Riak links won't
work at these numbers.

Check out more tech slides/blogs on how twitter does this. Basically
double/reverse/reciprocal look up lists with recipient notification.
When @aplusk tweets twitter does like 4million (or however many
followers he has) writes. I recommend @rk's qcon sf 2010 talk on nosql
at twitter.

Best, alexander

On 2011-02-09, Scott Lystig Fritchie <fritc...@snookles.com> wrote:
Nathan Sobo <ns...@pivotallabs.com> wrote:

ns> Is a key-value store actually inappropriate for this problem?

No. One way to do it is to use a single KV key to store multiple
addresses worth of info. Pick a relatively big number, 50K
subscribers/key, though it may vary. Use a key naming scheme so that
you can pre-calculate all keys for a given list, e.g. bucket =
list-subscribers, key = name + range index #, or perhaps list name
+ start-of-hash-range + end-of-hash-range.

How do you know the range index # or start & ends of range? One method would be hashing, MD5 or SHA1 or whatever. If you store all addresses for a list with a fixed number of hash hunks, e.g. 100, then each hash
hunk will have roughly 20K entries for a 2M subscriber list. To find
all subscribers, fetch 100 known keys.

If you want to keep addresses in sorted order, it's more work but also
doable. A naive plan is to make your hash function F(addr) = first
letter of 'addr'. Keys get clumpy that way, but only slightly more
creativity can get around it.

To find a particular subscriber, hash that subscriber's address and
fetch 1 key. You're also getting a lot of uninteresting data with that
key's value, but if event is uncommon, it's not a problem. (If that
event actually is common, consider moving the commonly-queried data
elsewhere. Or duplicate that info in another smaller key somewhere
else.) Similar logic for list maintenance events (add subscriber,
delete subscriber).

-Scott

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

--
Sent from my mobile device

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com

Reply via email to