The changes to FBUtilities.java is quite simple (just add one method). You can search the ExpiringColumn in our mailing list and found that one to which Sylvain attached 3 patches for branch 0.5.0. That's where I started and the patch worked successfully.
-Weijun On Sun, Mar 14, 2010 at 6:29 AM, Ryan Daum <r...@thimbleware.com> wrote: > +1, I'd like to try this patch but am running into error: patch failed: > src/java/org/apache/cassandra/utils/FBUtilities.java:342 > > Alternatively, someone could create a github fork which incorporates this > patch? > > Ryan > > On Sat, Mar 13, 2010 at 3:36 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > >> since they are separate changes, it's much easier to review if they >> are submitted separately. >> >> On 3/13/10, Weijun Li <weiju...@gmail.com> wrote: >> > Sure. I'm making another change for cross multiple DC replication, once >> this >> > one is done (probably in next week) I'll submit them together to Jira. >> All >> > based on 0.6 beta2. >> > >> > -Weijun >> > >> > -----Original Message----- >> > From: Jonathan Ellis [mailto:jbel...@gmail.com] >> > Sent: Saturday, March 13, 2010 5:36 AM >> > To: cassandra-u...@incubator.apache.org >> > Subject: Re: question about deleting from cassandra >> > >> > You should submit your minor change to jira for others who might want to >> try >> > it. >> > >> > On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weiju...@gmail.com> wrote: >> >> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked >> >> perfectly. Without this feature, as far as you have high volume new and >> >> expired columns your life will be miserable :-) >> >> >> >> Thanks for great job Sylvain!! >> >> >> >> -Weijun >> >> >> >> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylv...@yakaz.com> >> >> wrote: >> >>> >> >>> I guess you can also vote for this ticket : >> >>> https://issues.apache.org/jira/browse/CASSANDRA-699 :) >> >>> >> >>> </advertising> >> >>> >> >>> -- >> >>> Sylvain >> >>> >> >>> >> >>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com> >> wrote: >> >>> > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote: >> >>> >> >> >>> >> Let take Twitter as an example. All the tweets are timestamped. I >> >>> >> want >> >>> >> to keep only a month's worth of tweets for each user. The number >> of >> >>> >> tweets >> >>> >> that fit within this one month window varies from user to user. >> What >> >>> >> is the >> >>> >> best way to accomplish this? >> >>> > >> >>> > This is the "expiry" problem that has been discussed on this list >> >>> > before. As >> >>> > far as I can see there are no easy ways to do it with 0.5 >> >>> > >> >>> > If you use the ordered partitioner and make the first part of the >> keys >> > a >> >>> > timestamp (or part of it) then you can get the keys and delete them. >> >>> > >> >>> > However, these deletes will be quite inefficient, currently each row >> >>> > must be >> >>> > deleted individually (there was a patch to range delete kicking >> around, >> >>> > I >> >>> > don't know if it's accepted yet) >> >>> > >> >>> > But even if range delete is implemented, it's still quite >> inefficient >> >>> > and >> >>> > not really what you want, and doesn't work with the >> RandomPartitioner >> >>> > >> >>> > If you have some metadata to say who tweeted within a given period >> (say >> >>> > 10 >> >>> > days or 30 days) and you store the tweets all in the same key per >> user >> >>> > per >> >>> > period (say with one column per tweet, or use supercolumns), then >> you >> >>> > can >> >>> > just delete one key per user per period. >> >>> > >> >>> > One of the problems with using a time-based key with ordered >> > partitioner >> >>> > is >> >>> > that you're always going to have a data imbalance, so you may want >> to >> >>> > try >> >>> > hashing *part* of the key (The first part) so you can still range >> scan >> >>> > the >> >>> > next part. This may fix load balancing while still enabling you to >> use >> >>> > range >> >>> > scans to do data expiry. >> >>> > >> >>> > e.g. your key is >> >>> > >> >>> > Hash of day number + user id + timestamp >> >>> > >> >>> > Then you can range scan the entire day's tweets to expire them, and >> >>> > range >> >>> > scan a given user's tweets for a given day efficiently (and doing >> this >> >>> > for >> >>> > 30 days is just 30 range scans) >> >>> > >> >>> > Putting a hash in there fixes load balancing with OPP. >> >>> > >> >>> > Mark >> >>> > >> >> >> >> >> > >> > >> > >