Not surprising, re. Flickr. Don't be too clever when disk is cheap and
only getting cheaper. Remember, we in nosql land where denormilization
is ... the norm.
@siculars on twitter
http://siculars.posterous.com
Sent from my iPhone
On Feb 9, 2011, at 13:12, Jeremiah Peschka
<jeremiah.pesc...@gmail.com> wrote:
Incidentally, this is also how flickr handles writes - when you
upload a photo it gets written to wherever your other photos go.
When someone tags it or adds it to a group, it gets copied into that
group.
Unless, of course, it's all changed since the last time I looked for
information about how flickr actually works.
--
Jeremiah Peschka
Sent with Sparrow
On Wednesday, February 9, 2011 at 10:09 AM, Alexander Sicular wrote:
The only way this is functional is if you implement a uniformly
random
hash function so that you know which key any given address will hash
to. Separately, churn will eat you up if you constantly need to take
addys out of your keys. Also, as mentioned elsewhere, Riak links
won't
work at these numbers.
Check out more tech slides/blogs on how twitter does this. Basically
double/reverse/reciprocal look up lists with recipient notification.
When @aplusk tweets twitter does like 4million (or however many
followers he has) writes. I recommend @rk's qcon sf 2010 talk on
nosql
at twitter.
Best, alexander
On 2011-02-09, Scott Lystig Fritchie <fritc...@snookles.com> wrote:
Nathan Sobo <ns...@pivotallabs.com> wrote:
ns> Is a key-value store actually inappropriate for this problem?
No. One way to do it is to use a single KV key to store multiple
addresses worth of info. Pick a relatively big number, 50K
subscribers/key, though it may vary. Use a key naming scheme so that
you can pre-calculate all keys for a given list, e.g. bucket =
list-subscribers, key = name + range index #, or perhaps list name
+ start-of-hash-range + end-of-hash-range.
How do you know the range index # or start & ends of range? One
method
would be hashing, MD5 or SHA1 or whatever. If you store all
addresses
for a list with a fixed number of hash hunks, e.g. 100, then each
hash
hunk will have roughly 20K entries for a 2M subscriber list. To find
all subscribers, fetch 100 known keys.
If you want to keep addresses in sorted order, it's more work but
also
doable. A naive plan is to make your hash function F(addr) = first
letter of 'addr'. Keys get clumpy that way, but only slightly more
creativity can get around it.
To find a particular subscriber, hash that subscriber's address and
fetch 1 key. You're also getting a lot of uninteresting data with
that
key's value, but if event is uncommon, it's not a problem. (If that
event actually is common, consider moving the commonly-queried data
elsewhere. Or duplicate that info in another smaller key somewhere
else.) Similar logic for list maintenance events (add subscriber,
delete subscriber).
-Scott
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
--
Sent from my mobile device
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com
_______________________________________________
riak-users mailing list
riak-users@lists.basho.com
http://lists.basho.com/mailman/listinfo/riak-users_lists.basho.com