Limiting by number of columns in a row will perform very poorly. Limiting by the time a column has existed can perform quite well, and was added by Sylvain for 0.7 in https://issues.apache.org/jira/browse/CASSANDRA-699
On Fri, Apr 16, 2010 at 1:50 PM, Chris Shorrock <ch...@shorrockin.com> wrote: > I'm attempting to come up with a technique for limiting the number of > columns a single key (or super column - doesn't matter too much for the > context of this conversation) may contain at any one time. My actual > use-case is a little too meaty to try to describe so an alternate use-case > of this mechanism could be: > > Construct a twitter-esque feed which maintains a list N tweets. Tweets (in > this system - and in reality I suppose) occur at such a rate that you want > to limit a given users "feed" to N items. You do not have the ability to > store an infinite number of tweets due to the physical constraints of your > hardware. > > The "my first idea" answer is when a tweet is inserted into the the feed of > a given person, that you then do a count and delete of any outstanding > tweets. In reality you could first count, then (if count >= N) do a batch > mutate for the insertion of the new entry and the removal of the old. My > issue with this approach is that after a certain point every new entry into > the system will incur the removal of an old entry. The count, once a feed > has reached N will always be >= N on any subsequent queries. Depending on > how you index the tweets you may need to actually do a read instead of count > to get the row identifiers. > My second approach was to utilize a "slot" system where you have a record > stored somewhere that indicates the next slot for insertion. This can be > thought of as a fixed length array where you store the next insertion point > in some other column family. When a new tweet occurs you retrieve the > current "slot" meta-data, insert into that index, then update the meta-data > for the next insertion. My concerns with this relate around synchronization > and losing entries due to concurrent operations. I'd rather not have to > something like ZooKeeper to synchronize in the application cluster. > I have some other ideas but I'm mostly just spit-balling at this point. So I > thought I'd reach out the collective intelligence of the group to see if > anyone has implemented something similar. Thanks in advance.