persistent [priority] queues are better suited to something like HornetQ than Cassandra.
On Wed, May 25, 2011 at 9:10 PM, Dan Kuebrich <dan.kuebr...@gmail.com>wrote: > It sounds like the problem is that the row is getting filled up with > tombstones and becoming enormous? Another idea then, which might not be > worth the added complexity, is to progressively use new rows. Depending on > volume, this could mean having 5-minute-window rows, or 1 minute, or > whatever works best. > > Read: Assuming you're not falling behind, you only need to query the row > that the current time falls in and the one immediately prior. If you do > fall behind, you'll have to walk backwards in buckets until you find them > empty. > Write: Write column to the bucket (row) that corresponds to the correct > time window. > Delete: Delete the column from the row it was read from. When all columns > in the row are deleted the row can GC. > > Again, cassandra might not be the correct datastore. > > On Wed, May 25, 2011 at 3:56 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > >> You're basically intentionally inflicting the worst case scenario on >> the Cassandra storage engine: >> http://wiki.apache.org/cassandra/DistributedDeletes >> >> You could play around with reducing gc_grace_seconds but a PQ with >> "millions" of items is something you should probably just do in memory >> these days. >> >> On Wed, May 25, 2011 at 10:43 AM, <dnalls...@taz.qinetiq.com> wrote: >> > >> > >> > Hi all, >> > >> > I'm trying to implement a priority queue for holding a large number >> (millions) >> > of items that need to be processed in time order. My solution works - >> but gets >> > slower and slower until performance is unacceptable - even with a small >> number >> > of items. >> > >> > Each item essentially needs to be popped off the queue (some arbitrary >> work is >> > then done) and then the item is returned to the queue with a new >> timestamp >> > indicating when it should be processed again. We thus cycle through all >> work >> > items eventually, but some may come around more frequently than others. >> > >> > I am implementing this as a single Cassandra row, in a CF with a >> TimeUUID >> > comparator. >> > >> > Each column name is a TimeUUID, with an arbitrary column value >> describing the >> > work item; the columns are thus sorted in time order. >> > >> > To pop items, I do a get() such as: >> > >> > cf.get(row_key, column_finish=now, column_start=yesterday, >> column_count=1000) >> > >> > to get all the items at the head of the queue (if any) whose time >> exceeds the >> > current system time. >> > >> > For each item retrieved, I do a delete to remove the old column, then an >> insert >> > with a fresh TimeUUID column name (system time + arbitrary increment), >> thus >> > putting the item back somewhere in the queue (currently, the back of the >> queue) >> > >> > I do a batch_mutate for all these deletes and inserts, with a queue size >> of >> > 2000. These are currently interleaved i.e. >> delete1-insert1-delete2-insert2... >> > >> > This all appears to work correctly, but the performance starts at around >> 8000 >> > cycles/sec, falls to around 1800/sec over the first 250K cycles, and >> continues >> > to fall over time, down to about 150/sec, after a few million cycles. >> This >> > happens regardless of the overall size of the row (I have tried sizes >> from 1000 >> > to 100,000 items). My target performance is 1000 cycles/sec (but my data >> store >> > will need to handle other work concurrently). >> > >> > I am currently using just a single node running on localhost, using a >> pycassa >> > client. 4 core, 4GB machine, Fedora 14. >> > >> > Is this expected behaviour (is there just too much churn for a single >> row to >> > perform well), or am I doing something wrong? >> > >> > Would https://issues.apache.org/jira/browse/CASSANDRA-2583 in version >> 0.8.1 fix >> > this problem (I am using version 0.7.6)? >> > >> > Thanks! >> > >> > David. >> > >> > ---------------------------------------------------------------- >> > This message was sent using IMP, the Internet Messaging Program. >> > >> > This email and any attachments to it may be confidential and are >> > intended solely for the use of the individual to whom it is addressed. >> > If you are not the intended recipient of this email, you must neither >> > take any action based upon its contents, nor copy or show it to anyone. >> > Please contact the sender if you believe you have received this email in >> > error. QinetiQ may monitor email traffic data and also the content of >> > email for the purposes of security. QinetiQ Limited (Registered in >> > England & Wales: Company Number: 3796233) Registered office: Cody >> Technology >> > Park, Ively Road, Farnborough, Hampshire, GU14 0LX >> http://www.qinetiq.com. >> > >> >> >> >> -- >> Jonathan Ellis >> Project Chair, Apache Cassandra >> co-founder of DataStax, the source for professional Cassandra support >> http://www.datastax.com >> > > -- --------------------------------------------- Paul Loy p...@keteracel.com http://uk.linkedin.com/in/paulloy