Re: Mass deletion -- slowing down

Maxim Potekhin Sun, 13 Nov 2011 17:26:15 -0800

Brandon,

thanks for the note.

Each row represents a computational task (a job) executed on the grid orin the cloud. It naturally has a timestamp as one of its attributes,representing the time of the last update. This timestampis used to group the data into "buckets" each representing one day inthe system's activity.I create the "DATE" attribute and add it to each row, e.g. it's a column{'DATE','20111113'}.

I create an index on that column, along with a few others.

Now, I want to rotate the data out of my database, on daily basis. Forthat, I need to

select on 'DATE' and then do a delete.

I do limit the number of rows I'm asking for in Pycassa. Queries onprimary keys still work fine,it's just the indexed queries that start to time out. I changed timeoutsand number of retries

in the Pycassa pool, but that doesn't seem to help.

Thanks,
Maxim

On 11/13/2011 8:00 PM, Brandon Williams wrote:

On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhin<potek...@bnl.gov>  wrote:

Thanks to all for valuable insight!

Two comments:
a) this is not actually time series data, but yes, each item has
a timestamp and thus chronological attribution.

b) so, what do you practically recommend? I need to delete
half a million to a million entries daily, then insert fresh data.
What's the right operation procedure?

I'd have to know more about what your access pattern is like to give
you a fully informed answer.

For some reason I can still select on the index in the CLI, it's
the Pycassa module that gives me trouble, but I need it as this
is my platform and we are a Python shop.

This seems odd, since the rpc_timeout is the same for all clients.
Maybe pycassa is asking for more data than the cli?

-Brandon

Re: Mass deletion -- slowing down

Reply via email to