I would assume that Facebook and Twitter are not keep all the data that they store in Cassandra forever. I wonder how are they deleting old data from Cassandra...
Bill On Mon, Mar 15, 2010 at 1:01 PM, Weijun Li <weiju...@gmail.com> wrote: > OK I will try to separate them out. > > > On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > >> You should submit your minor change to jira for others who might want to >> try it. >> >> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weiju...@gmail.com> wrote: >> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked >> > perfectly. Without this feature, as far as you have high volume new and >> > expired columns your life will be miserable :-) >> > >> > Thanks for great job Sylvain!! >> > >> > -Weijun >> > >> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylv...@yakaz.com> >> > wrote: >> >> >> >> I guess you can also vote for this ticket : >> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :) >> >> >> >> </advertising> >> >> >> >> -- >> >> Sylvain >> >> >> >> >> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com> wrote: >> >> > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote: >> >> >> >> >> >> Let take Twitter as an example. All the tweets are timestamped. I >> >> >> want >> >> >> to keep only a month's worth of tweets for each user. The number of >> >> >> tweets >> >> >> that fit within this one month window varies from user to user. >> What >> >> >> is the >> >> >> best way to accomplish this? >> >> > >> >> > This is the "expiry" problem that has been discussed on this list >> >> > before. As >> >> > far as I can see there are no easy ways to do it with 0.5 >> >> > >> >> > If you use the ordered partitioner and make the first part of the >> keys a >> >> > timestamp (or part of it) then you can get the keys and delete them. >> >> > >> >> > However, these deletes will be quite inefficient, currently each row >> >> > must be >> >> > deleted individually (there was a patch to range delete kicking >> around, >> >> > I >> >> > don't know if it's accepted yet) >> >> > >> >> > But even if range delete is implemented, it's still quite inefficient >> >> > and >> >> > not really what you want, and doesn't work with the RandomPartitioner >> >> > >> >> > If you have some metadata to say who tweeted within a given period >> (say >> >> > 10 >> >> > days or 30 days) and you store the tweets all in the same key per >> user >> >> > per >> >> > period (say with one column per tweet, or use supercolumns), then you >> >> > can >> >> > just delete one key per user per period. >> >> > >> >> > One of the problems with using a time-based key with ordered >> partitioner >> >> > is >> >> > that you're always going to have a data imbalance, so you may want to >> >> > try >> >> > hashing *part* of the key (The first part) so you can still range >> scan >> >> > the >> >> > next part. This may fix load balancing while still enabling you to >> use >> >> > range >> >> > scans to do data expiry. >> >> > >> >> > e.g. your key is >> >> > >> >> > Hash of day number + user id + timestamp >> >> > >> >> > Then you can range scan the entire day's tweets to expire them, and >> >> > range >> >> > scan a given user's tweets for a given day efficiently (and doing >> this >> >> > for >> >> > 30 days is just 30 range scans) >> >> > >> >> > Putting a hash in there fixes load balancing with OPP. >> >> > >> >> > Mark >> >> > >> > >> > >> > >