Re: effective modeling for fixed limit columns

Mike Gallamore Fri, 16 Apr 2010 13:07:24 -0700

The problem I'm working on is very similar to this. I'm working on areputation system and we keep a fixed number of day buckets for thescores. So when new data comes in you need to find out what bucket issupposed to be used, remove the data in it if you've moved to a newbucket as the data there would be at least n + 1 days old where n is thenumber of days you are keeping, and then store the value. In our case weare willing to accept that we might occasionally lose a bit of data, asthe data tends to trend towards a "good enough" value quite quickly.Still it would be cool to know a way to make sure that we really knowthat we are safe to "nuke" a bucket shy if always insisting on blockingwrites on a "all" read. This can be very painful if you are replicatingdata across datacentres.

On 04/16/2010 11:50 AM, Chris Shorrock wrote:

I'm attempting to come up with a technique for limiting the number ofcolumns a single key (or super column - doesn't matter too much forthe context of this conversation) may contain at any one time. Myactual use-case is a little too meaty to try to describe so analternate use-case of this mechanism could be:
    /Construct a twitter-esque feed which maintains a list N tweets.
    Tweets (in this system - and in reality I suppose) occur at such a
    rate that you want to limit a given users "feed" to N items. You
    do not have the ability to store an infinite number of tweets due
    to the physical constraints of your hardware./
The "/my first idea/" answer is when a tweet is inserted into the thefeed of a given person, that you then do a count and delete of anyoutstanding tweets. In reality you could first count, then (if count>= N) do a batch mutate for the insertion of the new entry and theremoval of the old. My issue with this approach is that after acertain point every new entry into the system will incur the removalof an old entry. The count, once a feed has reached N will always be>= N on any subsequent queries. Depending on how you index the tweetsyou may need to actually do a read instead of count to get the rowidentifiers.
My second approach was to utilize a "slot" system where you have arecord stored somewhere that indicates the next slot for insertion.This can be thought of as a fixed length array where you store thenext insertion point in some other column family. When a new tweetoccurs you retrieve the current "slot" meta-data, insert into thatindex, then update the meta-data for the next insertion. My concernswith this relate around synchronization and losing entries due toconcurrent operations. I'd rather not have to something like ZooKeeperto synchronize in the application cluster.
I have some other ideas but I'm mostly just spit-balling at thispoint. So I thought I'd reach out the collective intelligence of thegroup to see if anyone has implemented something similar. Thanks inadvance.

Re: effective modeling for fixed limit columns

Reply via email to