Re: question about deleting from cassandra

Bill Au Thu, 18 Mar 2010 06:16:09 -0700

That is very true from the users' point of view, especially since their data
is being stored for free.  But I am looking at it from the service
providers' point of view.  Maybe that's why NoSQL solutions are so popular
right now since they scale much better than RDBMS.  I wonder if service
providers just keep adding more and more machines as the number of users and
amount of data grows.  In theory there is a breaking point somewhere, right?


Bill

On Wed, Mar 17, 2010 at 10:28 PM, Jonathan Ellis <jbel...@gmail.com> wrote:

> That's a strange assumption.  Users typically don't like their data
> being deleted without a very good reason.  "We didn't have enough
> room" is not a very good reason. :)
>
> On Wed, Mar 17, 2010 at 9:03 PM, Bill Au <bill.w...@gmail.com> wrote:
> > I would assume that Facebook and Twitter are not keep all the data that
> they
> > store in Cassandra forever.  I wonder how are they deleting old data from
> > Cassandra...
> > Bill
> >
> > On Mon, Mar 15, 2010 at 1:01 PM, Weijun Li <weiju...@gmail.com> wrote:
> >>
> >> OK I will try to separate them out.
> >>
> >> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jbel...@gmail.com>
> wrote:
> >>>
> >>> You should submit your minor change to jira for others who might want
> to
> >>> try it.
> >>>
> >>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weiju...@gmail.com> wrote:
> >>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it
> worked
> >>> > perfectly. Without this feature, as far as you have high volume new
> and
> >>> > expired columns your life will be miserable :-)
> >>> >
> >>> > Thanks for great job Sylvain!!
> >>> >
> >>> > -Weijun
> >>> >
> >>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <
> sylv...@yakaz.com>
> >>> > wrote:
> >>> >>
> >>> >> I guess you can also vote for this ticket :
> >>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
> >>> >>
> >>> >> </advertising>
> >>> >>
> >>> >> --
> >>> >> Sylvain
> >>> >>
> >>> >>
> >>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com>
> wrote:
> >>> >> > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote:
> >>> >> >>
> >>> >> >> Let take Twitter as an example.  All the tweets are timestamped.
>  I
> >>> >> >> want
> >>> >> >> to keep only a month's worth of tweets for each user.  The number
> >>> >> >> of
> >>> >> >> tweets
> >>> >> >> that fit within this one month window varies from user to user.
> >>> >> >>  What
> >>> >> >> is the
> >>> >> >> best way to accomplish this?
> >>> >> >
> >>> >> > This is the "expiry" problem that has been discussed on this list
> >>> >> > before. As
> >>> >> > far as I can see there are no easy ways to do it with 0.5
> >>> >> >
> >>> >> > If you use the ordered partitioner and make the first part of the
> >>> >> > keys a
> >>> >> > timestamp (or part of it) then you can get the keys and delete
> them.
> >>> >> >
> >>> >> > However, these deletes will be quite inefficient, currently each
> row
> >>> >> > must be
> >>> >> > deleted individually (there was a patch to range delete kicking
> >>> >> > around,
> >>> >> > I
> >>> >> > don't know if it's accepted yet)
> >>> >> >
> >>> >> > But even if range delete is implemented, it's still quite
> >>> >> > inefficient
> >>> >> > and
> >>> >> > not really what you want, and doesn't work with the
> >>> >> > RandomPartitioner
> >>> >> >
> >>> >> > If you have some metadata to say who tweeted within a given period
> >>> >> > (say
> >>> >> > 10
> >>> >> > days or 30 days) and you store the tweets all in the same key per
> >>> >> > user
> >>> >> > per
> >>> >> > period (say with one column per tweet, or use supercolumns), then
> >>> >> > you
> >>> >> > can
> >>> >> > just delete one key per user per period.
> >>> >> >
> >>> >> > One of the problems with using a time-based key with ordered
> >>> >> > partitioner
> >>> >> > is
> >>> >> > that you're always going to have a data imbalance, so you may want
> >>> >> > to
> >>> >> > try
> >>> >> > hashing *part* of the key (The first part) so you can still range
> >>> >> > scan
> >>> >> > the
> >>> >> > next part. This may fix load balancing while still enabling you to
> >>> >> > use
> >>> >> > range
> >>> >> > scans to do data expiry.
> >>> >> >
> >>> >> > e.g. your key is
> >>> >> >
> >>> >> > Hash of day number + user id + timestamp
> >>> >> >
> >>> >> > Then you can range scan the entire day's tweets to expire them,
> and
> >>> >> > range
> >>> >> > scan a given user's tweets for a given day efficiently (and doing
> >>> >> > this
> >>> >> > for
> >>> >> > 30 days is just 30 range scans)
> >>> >> >
> >>> >> > Putting a hash in there fixes load balancing with OPP.
> >>> >> >
> >>> >> > Mark
> >>> >> >
> >>> >
> >>> >
> >>
> >
> >
>

Re: question about deleting from cassandra

Reply via email to