Re: question about deleting from cassandra

Ryan Daum Sun, 14 Mar 2010 07:30:03 -0700

+1, I'd like to try this patch but am running into error: patch failed:
src/java/org/apache/cassandra/utils/FBUtilities.java:342


Alternatively, someone could create a github fork which incorporates this
patch?

Ryan

On Sat, Mar 13, 2010 at 3:36 PM, Jonathan Ellis <[email protected]> wrote:

> since they are separate changes, it's much easier to review if they
> are submitted separately.
>
> On 3/13/10, Weijun Li <[email protected]> wrote:
> > Sure. I'm making another change for cross multiple DC replication, once
> this
> > one is done (probably in next week) I'll submit them together to Jira.
> All
> > based on 0.6 beta2.
> >
> > -Weijun
> >
> > -----Original Message-----
> > From: Jonathan Ellis [mailto:[email protected]]
> > Sent: Saturday, March 13, 2010 5:36 AM
> > To: [email protected]
> > Subject: Re: question about deleting from cassandra
> >
> > You should submit your minor change to jira for others who might want to
> try
> > it.
> >
> > On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <[email protected]> wrote:
> >> Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
> >> perfectly. Without this feature, as far as you have high volume new and
> >> expired columns your life will be miserable :-)
> >>
> >> Thanks for great job Sylvain!!
> >>
> >> -Weijun
> >>
> >> On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <[email protected]>
> >> wrote:
> >>>
> >>> I guess you can also vote for this ticket :
> >>> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
> >>>
> >>> </advertising>
> >>>
> >>> --
> >>> Sylvain
> >>>
> >>>
> >>> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <[email protected]> wrote:
> >>> > On 12 March 2010 03:34, Bill Au <[email protected]> wrote:
> >>> >>
> >>> >> Let take Twitter as an example.  All the tweets are timestamped.  I
> >>> >> want
> >>> >> to keep only a month's worth of tweets for each user.  The number of
> >>> >> tweets
> >>> >> that fit within this one month window varies from user to user.
>  What
> >>> >> is the
> >>> >> best way to accomplish this?
> >>> >
> >>> > This is the "expiry" problem that has been discussed on this list
> >>> > before. As
> >>> > far as I can see there are no easy ways to do it with 0.5
> >>> >
> >>> > If you use the ordered partitioner and make the first part of the
> keys
> > a
> >>> > timestamp (or part of it) then you can get the keys and delete them.
> >>> >
> >>> > However, these deletes will be quite inefficient, currently each row
> >>> > must be
> >>> > deleted individually (there was a patch to range delete kicking
> around,
> >>> > I
> >>> > don't know if it's accepted yet)
> >>> >
> >>> > But even if range delete is implemented, it's still quite inefficient
> >>> > and
> >>> > not really what you want, and doesn't work with the RandomPartitioner
> >>> >
> >>> > If you have some metadata to say who tweeted within a given period
> (say
> >>> > 10
> >>> > days or 30 days) and you store the tweets all in the same key per
> user
> >>> > per
> >>> > period (say with one column per tweet, or use supercolumns), then you
> >>> > can
> >>> > just delete one key per user per period.
> >>> >
> >>> > One of the problems with using a time-based key with ordered
> > partitioner
> >>> > is
> >>> > that you're always going to have a data imbalance, so you may want to
> >>> > try
> >>> > hashing *part* of the key (The first part) so you can still range
> scan
> >>> > the
> >>> > next part. This may fix load balancing while still enabling you to
> use
> >>> > range
> >>> > scans to do data expiry.
> >>> >
> >>> > e.g. your key is
> >>> >
> >>> > Hash of day number + user id + timestamp
> >>> >
> >>> > Then you can range scan the entire day's tweets to expire them, and
> >>> > range
> >>> > scan a given user's tweets for a given day efficiently (and doing
> this
> >>> > for
> >>> > 30 days is just 30 range scans)
> >>> >
> >>> > Putting a hash in there fixes load balancing with OPP.
> >>> >
> >>> > Mark
> >>> >
> >>
> >>
> >
> >
>

Re: question about deleting from cassandra

Reply via email to