That is very true from the users' point of view, especially since their data is being stored for free. But I am looking at it from the service providers' point of view. Maybe that's why NoSQL solutions are so popular right now since they scale much better than RDBMS. I wonder if service providers just keep adding more and more machines as the number of users and amount of data grows. In theory there is a breaking point somewhere, right?
Bill On Wed, Mar 17, 2010 at 10:28 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > That's a strange assumption. Users typically don't like their data > being deleted without a very good reason. "We didn't have enough > room" is not a very good reason. :) > > On Wed, Mar 17, 2010 at 9:03 PM, Bill Au <bill.w...@gmail.com> wrote: > > I would assume that Facebook and Twitter are not keep all the data that > they > > store in Cassandra forever. I wonder how are they deleting old data from > > Cassandra... > > Bill > > > > On Mon, Mar 15, 2010 at 1:01 PM, Weijun Li <weiju...@gmail.com> wrote: > >> > >> OK I will try to separate them out. > >> > >> On Sat, Mar 13, 2010 at 5:35 AM, Jonathan Ellis <jbel...@gmail.com> > wrote: > >>> > >>> You should submit your minor change to jira for others who might want > to > >>> try it. > >>> > >>> On Sat, Mar 13, 2010 at 3:18 AM, Weijun Li <weiju...@gmail.com> wrote: > >>> > Tried Sylvain's feature in 0.6 beta2 (need minor change) and it > worked > >>> > perfectly. Without this feature, as far as you have high volume new > and > >>> > expired columns your life will be miserable :-) > >>> > > >>> > Thanks for great job Sylvain!! > >>> > > >>> > -Weijun > >>> > > >>> > On Fri, Mar 12, 2010 at 12:27 AM, Sylvain Lebresne < > sylv...@yakaz.com> > >>> > wrote: > >>> >> > >>> >> I guess you can also vote for this ticket : > >>> >> https://issues.apache.org/jira/browse/CASSANDRA-699 :) > >>> >> > >>> >> </advertising> > >>> >> > >>> >> -- > >>> >> Sylvain > >>> >> > >>> >> > >>> >> On Fri, Mar 12, 2010 at 8:28 AM, Mark Robson <mar...@gmail.com> > wrote: > >>> >> > On 12 March 2010 03:34, Bill Au <bill.w...@gmail.com> wrote: > >>> >> >> > >>> >> >> Let take Twitter as an example. All the tweets are timestamped. > I > >>> >> >> want > >>> >> >> to keep only a month's worth of tweets for each user. The number > >>> >> >> of > >>> >> >> tweets > >>> >> >> that fit within this one month window varies from user to user. > >>> >> >> What > >>> >> >> is the > >>> >> >> best way to accomplish this? > >>> >> > > >>> >> > This is the "expiry" problem that has been discussed on this list > >>> >> > before. As > >>> >> > far as I can see there are no easy ways to do it with 0.5 > >>> >> > > >>> >> > If you use the ordered partitioner and make the first part of the > >>> >> > keys a > >>> >> > timestamp (or part of it) then you can get the keys and delete > them. > >>> >> > > >>> >> > However, these deletes will be quite inefficient, currently each > row > >>> >> > must be > >>> >> > deleted individually (there was a patch to range delete kicking > >>> >> > around, > >>> >> > I > >>> >> > don't know if it's accepted yet) > >>> >> > > >>> >> > But even if range delete is implemented, it's still quite > >>> >> > inefficient > >>> >> > and > >>> >> > not really what you want, and doesn't work with the > >>> >> > RandomPartitioner > >>> >> > > >>> >> > If you have some metadata to say who tweeted within a given period > >>> >> > (say > >>> >> > 10 > >>> >> > days or 30 days) and you store the tweets all in the same key per > >>> >> > user > >>> >> > per > >>> >> > period (say with one column per tweet, or use supercolumns), then > >>> >> > you > >>> >> > can > >>> >> > just delete one key per user per period. > >>> >> > > >>> >> > One of the problems with using a time-based key with ordered > >>> >> > partitioner > >>> >> > is > >>> >> > that you're always going to have a data imbalance, so you may want > >>> >> > to > >>> >> > try > >>> >> > hashing *part* of the key (The first part) so you can still range > >>> >> > scan > >>> >> > the > >>> >> > next part. This may fix load balancing while still enabling you to > >>> >> > use > >>> >> > range > >>> >> > scans to do data expiry. > >>> >> > > >>> >> > e.g. your key is > >>> >> > > >>> >> > Hash of day number + user id + timestamp > >>> >> > > >>> >> > Then you can range scan the entire day's tweets to expire them, > and > >>> >> > range > >>> >> > scan a given user's tweets for a given day efficiently (and doing > >>> >> > this > >>> >> > for > >>> >> > 30 days is just 30 range scans) > >>> >> > > >>> >> > Putting a hash in there fixes load balancing with OPP. > >>> >> > > >>> >> > Mark > >>> >> > > >>> > > >>> > > >> > > > > >