Re: question about deleting from cassandra

Weijun Li Mon, 15 Mar 2010 10:01:02 -0700

The changes to FBUtilities.java is quite simple (just add one method). You
can search the ExpiringColumn in our mailing list and found that one to
which Sylvain attached 3 patches for branch 0.5.0. That's where I started
and the patch worked successfully.


-Weijun

On Sun, Mar 14, 2010 at 6:29 AM, Ryan Daum <r...@thimbleware.com> wrote:

> +1, I'd like to try this patch but am running into error: patch failed:
> src/java/org/apache/cassandra/utils/FBUtilities.java:342
>
> Alternatively, someone could create a github fork which incorporates this
> patch?
>
> Ryan
>
> On Sat, Mar 13, 2010 at 3:36 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>> since they are separate changes, it's much easier to review if they
>> are submitted separately.
>>
>> On 3/13/10, Weijun Li <weiju...@gmail.com> wrote:
>> > Sure. I'm making another change for cross multiple DC replication, once
>> this
>> > one is done (probably in next week) I'll submit them together to Jira.
>> All
>> > based on 0.6 beta2.
>> >
>> > -Weijun
>> >
>> > -----Original Message-----
>> > From: Jonathan Ellis [mailto:jbel...@gmail.com]
>> > Sent: Saturday, March 13, 2010 5:36 AM
>> > To: cassandra-u...@incubator.apache.org
>> > Subject: Re: question about deleting from cassandra
>> >
>> > You should submit your minor change to jira for others who might want to
>> try
>> > it.
>> >
>> > On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weiju...@gmail.com> wrote:
>> >> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>> >> perfectly. Without this feature, as far as you have high volume new and
>> >> expired columns your life will be miserable :-)
>> >>
>> >> Thanks for great job Sylvain!!
>> >>
>> >> -Weijun
>> >>
>> >> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylv...@yakaz.com>
>> >> wrote:
>> >>>
>> >>> I guess you can also vote for this ticket :
>> >>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>> >>>
>> >>> </advertising>
>> >>>
>> >>> --
>> >>> Sylvain
>> >>>
>> >>>
>> >>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com>
>> wrote:
>> >>> > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote:
>> >>> >>
>> >>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>> >>> >> want
>> >>> >> to keep only a month's worth of tweets for each user.  The number
>> of
>> >>> >> tweets
>> >>> >> that fit within this one month window varies from user to user.
>>  What
>> >>> >> is the
>> >>> >> best way to accomplish this?
>> >>> >
>> >>> > This is the "expiry" problem that has been discussed on this list
>> >>> > before. As
>> >>> > far as I can see there are no easy ways to do it with 0.5
>> >>> >
>> >>> > If you use the ordered partitioner and make the first part of the
>> keys
>> > a
>> >>> > timestamp (or part of it) then you can get the keys and delete them.
>> >>> >
>> >>> > However, these deletes will be quite inefficient, currently each row
>> >>> > must be
>> >>> > deleted individually (there was a patch to range delete kicking
>> around,
>> >>> > I
>> >>> > don't know if it's accepted yet)
>> >>> >
>> >>> > But even if range delete is implemented, it's still quite
>> inefficient
>> >>> > and
>> >>> > not really what you want, and doesn't work with the
>> RandomPartitioner
>> >>> >
>> >>> > If you have some metadata to say who tweeted within a given period
>> (say
>> >>> > 10
>> >>> > days or 30 days) and you store the tweets all in the same key per
>> user
>> >>> > per
>> >>> > period (say with one column per tweet, or use supercolumns), then
>> you
>> >>> > can
>> >>> > just delete one key per user per period.
>> >>> >
>> >>> > One of the problems with using a time-based key with ordered
>> > partitioner
>> >>> > is
>> >>> > that you're always going to have a data imbalance, so you may want
>> to
>> >>> > try
>> >>> > hashing *part* of the key (The first part) so you can still range
>> scan
>> >>> > the
>> >>> > next part. This may fix load balancing while still enabling you to
>> use
>> >>> > range
>> >>> > scans to do data expiry.
>> >>> >
>> >>> > e.g. your key is
>> >>> >
>> >>> > Hash of day number + user id + timestamp
>> >>> >
>> >>> > Then you can range scan the entire day's tweets to expire them, and
>> >>> > range
>> >>> > scan a given user's tweets for a given day efficiently (and doing
>> this
>> >>> > for
>> >>> > 30 days is just 30 range scans)
>> >>> >
>> >>> > Putting a hash in there fixes load balancing with OPP.
>> >>> >
>> >>> > Mark
>> >>> >
>> >>
>> >>
>> >
>> >
>>
>
>

Re: question about deleting from cassandra

Reply via email to