You're basically intentionally inflicting the worst case scenario on
the Cassandra storage engine:
http://wiki.apache.org/cassandra/DistributedDeletes

You could play around with reducing gc_grace_seconds but a PQ with
"millions" of items is something you should probably just do in memory
these days.

On Wed, May 25, 2011 at 10:43 AM,  <dnalls...@taz.qinetiq.com> wrote:
>
>
> Hi all,
>
> I'm trying to implement a priority queue for holding a large number (millions)
> of items that need to be processed in time order. My solution works - but gets
> slower and slower until performance is unacceptable - even with a small number
> of items.
>
> Each item essentially needs to be popped off the queue (some arbitrary work is
> then done) and then the item is returned to the queue with a new timestamp
> indicating when it should be processed again. We thus cycle through all work
> items eventually, but some may come around more frequently than others.
>
> I am implementing this as a single Cassandra row, in a CF with a TimeUUID
> comparator.
>
> Each column name is a TimeUUID, with an arbitrary column value describing the
> work item; the columns are thus sorted in time order.
>
> To pop items, I do a get() such as:
>
>  cf.get(row_key, column_finish=now, column_start=yesterday, column_count=1000)
>
> to get all the items at the head of the queue (if any) whose time exceeds the
> current system time.
>
> For each item retrieved, I do a delete to remove the old column, then an 
> insert
> with a fresh TimeUUID column name (system time + arbitrary increment), thus
> putting the item back somewhere in the queue (currently, the back of the 
> queue)
>
> I do a batch_mutate for all these deletes and inserts, with a queue size of
> 2000. These are currently interleaved i.e. delete1-insert1-delete2-insert2...
>
> This all appears to work correctly, but the performance starts at around 8000
> cycles/sec, falls to around 1800/sec over the first 250K cycles, and continues
> to fall over time, down to about 150/sec, after a few million cycles. This
> happens regardless of the overall size of the row (I have tried sizes from 
> 1000
> to 100,000 items). My target performance is 1000 cycles/sec (but my data store
> will need to handle other work concurrently).
>
> I am currently using just a single node running on localhost, using a pycassa
> client. 4 core, 4GB machine, Fedora 14.
>
> Is this expected behaviour (is there just too much churn for a single row to
> perform well), or am I doing something wrong?
>
> Would https://issues.apache.org/jira/browse/CASSANDRA-2583 in version 0.8.1 
> fix
> this problem (I am using version 0.7.6)?
>
> Thanks!
>
> David.
>
> ----------------------------------------------------------------
> This message was sent using IMP, the Internet Messaging Program.
>
> This email and any attachments to it may be confidential and are
> intended solely for the use of the individual to whom it is addressed.
> If you are not the intended recipient of this email, you must neither
> take any action based upon its contents, nor copy or show it to anyone.
> Please contact the sender if you believe you have received this email in
> error. QinetiQ may monitor email traffic data and also the content of
> email for the purposes of security. QinetiQ Limited (Registered in
> England & Wales: Company Number: 3796233) Registered office: Cody Technology
> Park, Ively Road, Farnborough, Hampshire, GU14 0LX http://www.qinetiq.com.
>



-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of DataStax, the source for professional Cassandra support
http://www.datastax.com

Reply via email to