+1, I'd like to try this patch but am running into error: patch failed: src/java/org/apache/cassandra/utils/FBUtilities.java:342
Alternatively, someone could create a github fork which incorporates this patch? Ryan On Sat, Mar 13, 2010 at 3:36 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > since they are separate changes, it's much easier to review if they > are submitted separately. > > On 3/13/10, Weijun Li <weiju...@gmail.com> wrote: > > Sure. I'm making another change for cross multiple DC replication, once > this > > one is done (probably in next week) I'll submit them together to Jira. > All > > based on 0.6 beta2. > > > > -Weijun > > > > -----Original Message----- > > From: Jonathan Ellis [mailto:jbel...@gmail.com] > > Sent: Saturday, March 13, 2010 5:36 AM > > To: cassandra-u...@incubator.apache.org > > Subject: Re: question about deleting from cassandra > > > > You should submit your minor change to jira for others who might want to > try > > it. > > > > On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weiju...@gmail.com> wrote: > >> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked > >> perfectly. Without this feature, as far as you have high volume new and > >> expired columns your life will be miserable :-) > >> > >> Thanks for great job Sylvain!! > >> > >> -Weijun > >> > >> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylv...@yakaz.com> > >> wrote: > >>> > >>> I guess you can also vote for this ticket : > >>> https://issues.apache.org/jira/browse/CASSANDRA-699 :) > >>> > >>> </advertising> > >>> > >>> -- > >>> Sylvain > >>> > >>> > >>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com> wrote: > >>> > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote: > >>> >> > >>> >> Let take Twitter as an example. All the tweets are timestamped. I > >>> >> want > >>> >> to keep only a month's worth of tweets for each user. The number of > >>> >> tweets > >>> >> that fit within this one month window varies from user to user. > What > >>> >> is the > >>> >> best way to accomplish this? > >>> > > >>> > This is the "expiry" problem that has been discussed on this list > >>> > before. As > >>> > far as I can see there are no easy ways to do it with 0.5 > >>> > > >>> > If you use the ordered partitioner and make the first part of the > keys > > a > >>> > timestamp (or part of it) then you can get the keys and delete them. > >>> > > >>> > However, these deletes will be quite inefficient, currently each row > >>> > must be > >>> > deleted individually (there was a patch to range delete kicking > around, > >>> > I > >>> > don't know if it's accepted yet) > >>> > > >>> > But even if range delete is implemented, it's still quite inefficient > >>> > and > >>> > not really what you want, and doesn't work with the RandomPartitioner > >>> > > >>> > If you have some metadata to say who tweeted within a given period > (say > >>> > 10 > >>> > days or 30 days) and you store the tweets all in the same key per > user > >>> > per > >>> > period (say with one column per tweet, or use supercolumns), then you > >>> > can > >>> > just delete one key per user per period. > >>> > > >>> > One of the problems with using a time-based key with ordered > > partitioner > >>> > is > >>> > that you're always going to have a data imbalance, so you may want to > >>> > try > >>> > hashing *part* of the key (The first part) so you can still range > scan > >>> > the > >>> > next part. This may fix load balancing while still enabling you to > use > >>> > range > >>> > scans to do data expiry. > >>> > > >>> > e.g. your key is > >>> > > >>> > Hash of day number + user id + timestamp > >>> > > >>> > Then you can range scan the entire day's tweets to expire them, and > >>> > range > >>> > scan a given user's tweets for a given day efficiently (and doing > this > >>> > for > >>> > 30 days is just 30 range scans) > >>> > > >>> > Putting a hash in there fixes load balancing with OPP. > >>> > > >>> > Mark > >>> > > >> > >> > > > > >