You're basically intentionally inflicting the worst case scenario on the Cassandra storage engine: http://wiki.apache.org/cassandra/DistributedDeletes
You could play around with reducing gc_grace_seconds but a PQ with "millions" of items is something you should probably just do in memory these days. On Wed, May 25, 2011 at 10:43 AM, <dnalls...@taz.qinetiq.com> wrote: > > > Hi all, > > I'm trying to implement a priority queue for holding a large number (millions) > of items that need to be processed in time order. My solution works - but gets > slower and slower until performance is unacceptable - even with a small number > of items. > > Each item essentially needs to be popped off the queue (some arbitrary work is > then done) and then the item is returned to the queue with a new timestamp > indicating when it should be processed again. We thus cycle through all work > items eventually, but some may come around more frequently than others. > > I am implementing this as a single Cassandra row, in a CF with a TimeUUID > comparator. > > Each column name is a TimeUUID, with an arbitrary column value describing the > work item; the columns are thus sorted in time order. > > To pop items, I do a get() such as: > > cf.get(row_key, column_finish=now, column_start=yesterday, column_count=1000) > > to get all the items at the head of the queue (if any) whose time exceeds the > current system time. > > For each item retrieved, I do a delete to remove the old column, then an > insert > with a fresh TimeUUID column name (system time + arbitrary increment), thus > putting the item back somewhere in the queue (currently, the back of the > queue) > > I do a batch_mutate for all these deletes and inserts, with a queue size of > 2000. These are currently interleaved i.e. delete1-insert1-delete2-insert2... > > This all appears to work correctly, but the performance starts at around 8000 > cycles/sec, falls to around 1800/sec over the first 250K cycles, and continues > to fall over time, down to about 150/sec, after a few million cycles. This > happens regardless of the overall size of the row (I have tried sizes from > 1000 > to 100,000 items). My target performance is 1000 cycles/sec (but my data store > will need to handle other work concurrently). > > I am currently using just a single node running on localhost, using a pycassa > client. 4 core, 4GB machine, Fedora 14. > > Is this expected behaviour (is there just too much churn for a single row to > perform well), or am I doing something wrong? > > Would https://issues.apache.org/jira/browse/CASSANDRA-2583 in version 0.8.1 > fix > this problem (I am using version 0.7.6)? > > Thanks! > > David. > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > > This email and any attachments to it may be confidential and are > intended solely for the use of the individual to whom it is addressed. > If you are not the intended recipient of this email, you must neither > take any action based upon its contents, nor copy or show it to anyone. > Please contact the sender if you believe you have received this email in > error. QinetiQ may monitor email traffic data and also the content of > email for the purposes of security. QinetiQ Limited (Registered in > England & Wales: Company Number: 3796233) Registered office: Cody Technology > Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com. > -- Jonathan Ellis Project Chair, Apache Cassandra co-founder of DataStax, the source for professional Cassandra support http://www.datastax.com