Re: Mass deletion -- slowing down

2011-11-14 Thread Maxim Potekhin
Thanks for the note. Ideally I would not like to keep track of what is the oldest indexed date, because this means that I'm already creating a bit of infrastructure on top of my database, with attendant referential integrity problems. But I suppose I'll be forced to do that. In addition, I'll h

Re: Mass deletion -- slowing down

2011-11-14 Thread Guy Incognito
i think what he means is...do you know what day the 'oldest' day is? eg if you have a rolling window of say 2 weeks, structure your query so that your slice range only goes back 2 weeks, rather than to the beginning of time. this would avoid iterating over all the tombstones from prior to the

Re: Mass deletion -- slowing down

2011-11-13 Thread Peter Schuller
> I'm not sure I entirely follow. By the oldest data, do you mean the > primary key corresponding to the limit of the time horizon? Unfortunately, > unique IDs and the timstamps do not correlate in the sense that > chronologically > "newer" entries might have a smaller sequential ID. That's because

Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin
Thanks Peter, I'm not sure I entirely follow. By the oldest data, do you mean the primary key corresponding to the limit of the time horizon? Unfortunately, unique IDs and the timstamps do not correlate in the sense that chronologically "newer" entries might have a smaller sequential ID. That's

Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin
Brandon, it won't work in my application, as I need a few indexes on attributes of the job. In addition, a large portion of queries is based on key-value lookup, and that key is the unique job ID. I really can't have data packed in one row per day. Thanks, Maxim On 11/13/2011 8:34 PM, Brandon

Re: Mass deletion -- slowing down

2011-11-13 Thread Peter Schuller
> I do limit the number of rows I'm asking for in Pycassa. Queries on primary > keys still work fine, Is it feasable in your situation to keep track of the oldest possible data (for example, if there is a single sequential writer that rotates old entries away it could keep a record of what the old

Re: Mass deletion -- slowing down

2011-11-13 Thread Brandon Williams
On Sun, Nov 13, 2011 at 7:25 PM, Maxim Potekhin wrote: > Each row represents a computational task (a job) executed on the grid or in > the cloud. It naturally has a timestamp as one of its attributes, > representing the time of the last update. This timestamp > is used to group the data into "buck

Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin
Brandon, thanks for the note. Each row represents a computational task (a job) executed on the grid or in the cloud. It naturally has a timestamp as one of its attributes, representing the time of the last update. This timestamp is used to group the data into "buckets" each representing one da

Re: Mass deletion -- slowing down

2011-11-13 Thread Brandon Williams
On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhin wrote: > Thanks to all for valuable insight! > > Two comments: > a) this is not actually time series data, but yes, each item has > a timestamp and thus chronological attribution. > > b) so, what do you practically recommend? I need to delete > half

Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin
Thanks to all for valuable insight! Two comments: a) this is not actually time series data, but yes, each item has a timestamp and thus chronological attribution. b) so, what do you practically recommend? I need to delete half a million to a million entries daily, then insert fresh data. What's

Re: Mass deletion -- slowing down

2011-11-13 Thread Peter Schuller
Deletions in Cassandra imply the use of tombstones (see http://wiki.apache.org/cassandra/DistributedDeletes) and under some circumstances reads can turn O(n) with respect to the amount of columns deleted, depending. It sounds like this is what you're seeing. For example, suppose you're inserting a

Re: Mass deletion -- slowing down

2011-11-13 Thread Brandon Williams
On Sun, Nov 13, 2011 at 5:57 PM, Maxim Potekhin wrote: > I've done more experimentation and the behavior persists: I start with a > normal dataset which is searcheable by a secondary index. I select by that > index the entries that match a certain criterion, then delete those. I tried > two method

Re: Mass deletion -- slowing down

2011-11-13 Thread Maxim Potekhin
I've done more experimentation and the behavior persists: I start with a normal dataset which is searcheable by a secondary index. I select by that index the entries that match a certain criterion, then delete those. I tried two methods of deletion -- individual cf.remove() as well as batch rem

Mass deletion -- slowing down

2011-11-10 Thread Maxim Potekhin
Hello, My data load comes in batches representing one day in the life of a large computing facility. I index the data by the day it was produced, to be able to quickly pull data for a specific day within the last year or two. There are 6 other indexes. When it comes to retiring the data, I in