Brandon,
thanks for the note.
Each row represents a computational task (a job) executed on the grid or
in the cloud. It naturally has a timestamp as one of its attributes,
representing the time of the last update. This timestamp
is used to group the data into "buckets" each representing one day in
the system's activity.
I create the "DATE" attribute and add it to each row, e.g. it's a column
{'DATE','20111113'}.
I create an index on that column, along with a few others.
Now, I want to rotate the data out of my database, on daily basis. For
that, I need to
select on 'DATE' and then do a delete.
I do limit the number of rows I'm asking for in Pycassa. Queries on
primary keys still work fine,
it's just the indexed queries that start to time out. I changed timeouts
and number of retries
in the Pycassa pool, but that doesn't seem to help.
Thanks,
Maxim
On 11/13/2011 8:00 PM, Brandon Williams wrote:
On Sun, Nov 13, 2011 at 6:55 PM, Maxim Potekhin<potek...@bnl.gov> wrote:
Thanks to all for valuable insight!
Two comments:
a) this is not actually time series data, but yes, each item has
a timestamp and thus chronological attribution.
b) so, what do you practically recommend? I need to delete
half a million to a million entries daily, then insert fresh data.
What's the right operation procedure?
I'd have to know more about what your access pattern is like to give
you a fully informed answer.
For some reason I can still select on the index in the CLI, it's
the Pycassa module that gives me trouble, but I need it as this
is my platform and we are a Python shop.
This seems odd, since the rpc_timeout is the same for all clients.
Maybe pycassa is asking for more data than the cli?
-Brandon