That's a strange assumption. Users typically don't like their data being deleted without a very good reason. "We didn't have enough room" is not a very good reason. :)
On Wed, Mar 17, 2010 at 9:03 PM, Bill Au <bill.w...@gmail.com> wrote: > I would assume that Facebook and Twitter are not keep all the data that they > store in Cassandra forever. I wonder how are they deleting old data from > Cassandra... > Bill > > On Mon, Mar 15, 2010 at 1:01 PM, Weijun Li <weiju...@gmail.com> wrote: >> >> OK I will try to separate them out. >> >> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>> You should submit your minor change to jira for others who might want to >>> try it. >>> >>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weiju...@gmail.com> wrote: >>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked >>> > perfectly. Without this feature, as far as you have high volume new and >>> > expired columns your life will be miserable :-) >>> > >>> > Thanks for great job Sylvain!! >>> > >>> > -Weijun >>> > >>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylv...@yakaz.com> >>> > wrote: >>> >> >>> >> I guess you can also vote for this ticket : >>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :) >>> >> >>> >> </advertising> >>> >> >>> >> -- >>> >> Sylvain >>> >> >>> >> >>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com> wrote: >>> >> > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote: >>> >> >> >>> >> >> Let take Twitter as an example. All the tweets are timestamped. I >>> >> >> want >>> >> >> to keep only a month's worth of tweets for each user. The number >>> >> >> of >>> >> >> tweets >>> >> >> that fit within this one month window varies from user to user. >>> >> >> What >>> >> >> is the >>> >> >> best way to accomplish this? >>> >> > >>> >> > This is the "expiry" problem that has been discussed on this list >>> >> > before. As >>> >> > far as I can see there are no easy ways to do it with 0.5 >>> >> > >>> >> > If you use the ordered partitioner and make the first part of the >>> >> > keys a >>> >> > timestamp (or part of it) then you can get the keys and delete them. >>> >> > >>> >> > However, these deletes will be quite inefficient, currently each row >>> >> > must be >>> >> > deleted individually (there was a patch to range delete kicking >>> >> > around, >>> >> > I >>> >> > don't know if it's accepted yet) >>> >> > >>> >> > But even if range delete is implemented, it's still quite >>> >> > inefficient >>> >> > and >>> >> > not really what you want, and doesn't work with the >>> >> > RandomPartitioner >>> >> > >>> >> > If you have some metadata to say who tweeted within a given period >>> >> > (say >>> >> > 10 >>> >> > days or 30 days) and you store the tweets all in the same key per >>> >> > user >>> >> > per >>> >> > period (say with one column per tweet, or use supercolumns), then >>> >> > you >>> >> > can >>> >> > just delete one key per user per period. >>> >> > >>> >> > One of the problems with using a time-based key with ordered >>> >> > partitioner >>> >> > is >>> >> > that you're always going to have a data imbalance, so you may want >>> >> > to >>> >> > try >>> >> > hashing *part* of the key (The first part) so you can still range >>> >> > scan >>> >> > the >>> >> > next part. This may fix load balancing while still enabling you to >>> >> > use >>> >> > range >>> >> > scans to do data expiry. >>> >> > >>> >> > e.g. your key is >>> >> > >>> >> > Hash of day number + user id + timestamp >>> >> > >>> >> > Then you can range scan the entire day's tweets to expire them, and >>> >> > range >>> >> > scan a given user's tweets for a given day efficiently (and doing >>> >> > this >>> >> > for >>> >> > 30 days is just 30 range scans) >>> >> > >>> >> > Putting a hash in there fixes load balancing with OPP. >>> >> > >>> >> > Mark >>> >> > >>> > >>> > >> > >