Re: question about deleting from cassandra

Jonathan Ellis Wed, 17 Mar 2010 19:29:39 -0700

That's a strange assumption.  Users typically don't like their data
being deleted without a very good reason.  "We didn't have enough
room" is not a very good reason. :)


On Wed, Mar 17, 2010 at 9:03 PM, Bill Au <bill.w...@gmail.com> wrote:
> I would assume that Facebook and Twitter are not keep all the data that they
> store in Cassandra forever.  I wonder how are they deleting old data from
> Cassandra...
> Bill
>
> On Mon, Mar 15, 2010 at 1:01 PM, Weijun Li <weiju...@gmail.com> wrote:
>>
>> OK I will try to separate them out.
>>
>> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>
>>> You should submit your minor change to jira for others who might want to
>>> try it.
>>>
>>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weiju...@gmail.com> wrote:
>>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it worked
>>> > perfectly. Without this feature, as far as you have high volume new and
>>> > expired columns your life will be miserable :-)
>>> >
>>> > Thanks for great job Sylvain!!
>>> >
>>> > -Weijun
>>> >
>>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne <sylv...@yakaz.com>
>>> > wrote:
>>> >>
>>> >> I guess you can also vote for this ticket :
>>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :)
>>> >>
>>> >> </advertising>
>>> >>
>>> >> --
>>> >> Sylvain
>>> >>
>>> >>
>>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com> wrote:
>>> >> > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote:
>>> >> >>
>>> >> >> Let take Twitter as an example.  All the tweets are timestamped.  I
>>> >> >> want
>>> >> >> to keep only a month's worth of tweets for each user.  The number
>>> >> >> of
>>> >> >> tweets
>>> >> >> that fit within this one month window varies from user to user.
>>> >> >>  What
>>> >> >> is the
>>> >> >> best way to accomplish this?
>>> >> >
>>> >> > This is the "expiry" problem that has been discussed on this list
>>> >> > before. As
>>> >> > far as I can see there are no easy ways to do it with 0.5
>>> >> >
>>> >> > If you use the ordered partitioner and make the first part of the
>>> >> > keys a
>>> >> > timestamp (or part of it) then you can get the keys and delete them.
>>> >> >
>>> >> > However, these deletes will be quite inefficient, currently each row
>>> >> > must be
>>> >> > deleted individually (there was a patch to range delete kicking
>>> >> > around,
>>> >> > I
>>> >> > don't know if it's accepted yet)
>>> >> >
>>> >> > But even if range delete is implemented, it's still quite
>>> >> > inefficient
>>> >> > and
>>> >> > not really what you want, and doesn't work with the
>>> >> > RandomPartitioner
>>> >> >
>>> >> > If you have some metadata to say who tweeted within a given period
>>> >> > (say
>>> >> > 10
>>> >> > days or 30 days) and you store the tweets all in the same key per
>>> >> > user
>>> >> > per
>>> >> > period (say with one column per tweet, or use supercolumns), then
>>> >> > you
>>> >> > can
>>> >> > just delete one key per user per period.
>>> >> >
>>> >> > One of the problems with using a time-based key with ordered
>>> >> > partitioner
>>> >> > is
>>> >> > that you're always going to have a data imbalance, so you may want
>>> >> > to
>>> >> > try
>>> >> > hashing *part* of the key (The first part) so you can still range
>>> >> > scan
>>> >> > the
>>> >> > next part. This may fix load balancing while still enabling you to
>>> >> > use
>>> >> > range
>>> >> > scans to do data expiry.
>>> >> >
>>> >> > e.g. your key is
>>> >> >
>>> >> > Hash of day number + user id + timestamp
>>> >> >
>>> >> > Then you can range scan the entire day's tweets to expire them, and
>>> >> > range
>>> >> > scan a given user's tweets for a given day efficiently (and doing
>>> >> > this
>>> >> > for
>>> >> > 30 days is just 30 range scans)
>>> >> >
>>> >> > Putting a hash in there fixes load balancing with OPP.
>>> >> >
>>> >> > Mark
>>> >> >
>>> >
>>> >
>>
>
>

Re: question about deleting from cassandra

Reply via email to