Re: Priority queue in a single row - performance falls over time

Paul Loy Thu, 26 May 2011 02:53:29 -0700

persistent [priority] queues are better suited to something like HornetQ
than Cassandra.


On Wed, May 25, 2011 at 9:10 PM, Dan Kuebrich <dan.kuebr...@gmail.com>wrote:

> It sounds like the problem is that the row is getting filled up with
> tombstones and becoming enormous?  Another idea then, which might not be
> worth the added complexity, is to progressively use new rows.  Depending on
> volume, this could mean having 5-minute-window rows, or 1 minute, or
> whatever works best.
>
> Read: Assuming you're not falling behind, you only need to query the row
> that the current time falls in and the one immediately prior.  If you do
> fall behind, you'll have to walk backwards in buckets until you find them
> empty.
> Write: Write column to the bucket (row) that corresponds to the correct
> time window.
> Delete: Delete the column from the row it was read from.  When all columns
> in the row are deleted the row can GC.
>
> Again, cassandra might not be the correct datastore.
>
> On Wed, May 25, 2011 at 3:56 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>> You're basically intentionally inflicting the worst case scenario on
>> the Cassandra storage engine:
>> http://wiki.apache.org/cassandra/DistributedDeletes
>>
>> You could play around with reducing gc_grace_seconds but a PQ with
>> "millions" of items is something you should probably just do in memory
>> these days.
>>
>> On Wed, May 25, 2011 at 10:43 AM,  <dnalls...@taz.qinetiq.com> wrote:
>> >
>> >
>> > Hi all,
>> >
>> > I'm trying to implement a priority queue for holding a large number
>> (millions)
>> > of items that need to be processed in time order. My solution works -
>> but gets
>> > slower and slower until performance is unacceptable - even with a small
>> number
>> > of items.
>> >
>> > Each item essentially needs to be popped off the queue (some arbitrary
>> work is
>> > then done) and then the item is returned to the queue with a new
>> timestamp
>> > indicating when it should be processed again. We thus cycle through all
>> work
>> > items eventually, but some may come around more frequently than others.
>> >
>> > I am implementing this as a single Cassandra row, in a CF with a
>> TimeUUID
>> > comparator.
>> >
>> > Each column name is a TimeUUID, with an arbitrary column value
>> describing the
>> > work item; the columns are thus sorted in time order.
>> >
>> > To pop items, I do a get() such as:
>> >
>> >  cf.get(row_key, column_finish=now, column_start=yesterday,
>> column_count=1000)
>> >
>> > to get all the items at the head of the queue (if any) whose time
>> exceeds the
>> > current system time.
>> >
>> > For each item retrieved, I do a delete to remove the old column, then an
>> insert
>> > with a fresh TimeUUID column name (system time + arbitrary increment),
>> thus
>> > putting the item back somewhere in the queue (currently, the back of the
>> queue)
>> >
>> > I do a batch_mutate for all these deletes and inserts, with a queue size
>> of
>> > 2000. These are currently interleaved i.e.
>> delete1-insert1-delete2-insert2...
>> >
>> > This all appears to work correctly, but the performance starts at around
>> 8000
>> > cycles/sec, falls to around 1800/sec over the first 250K cycles, and
>> continues
>> > to fall over time, down to about 150/sec, after a few million cycles.
>> This
>> > happens regardless of the overall size of the row (I have tried sizes
>> from 1000
>> > to 100,000 items). My target performance is 1000 cycles/sec (but my data
>> store
>> > will need to handle other work concurrently).
>> >
>> > I am currently using just a single node running on localhost, using a
>> pycassa
>> > client. 4 core, 4GB machine, Fedora 14.
>> >
>> > Is this expected behaviour (is there just too much churn for a single
>> row to
>> > perform well), or am I doing something wrong?
>> >
>> > Would https://issues.apache.org/jira/browse/CASSANDRA-2583 in version
>> 0.8.1 fix
>> > this problem (I am using version 0.7.6)?
>> >
>> > Thanks!
>> >
>> > David.
>> >
>> > ----------------------------------------------------------------
>> > This message was sent using IMP, the Internet Messaging Program.
>> >
>> > This email and any attachments to it may be confidential and are
>> > intended solely for the use of the individual to whom it is addressed.
>> > If you are not the intended recipient of this email, you must neither
>> > take any action based upon its contents, nor copy or show it to anyone.
>> > Please contact the sender if you believe you have received this email in
>> > error. QinetiQ may monitor email traffic data and also the content of
>> > email for the purposes of security. QinetiQ Limited (Registered in
>> > England & Wales: Company Number: 3796233) Registered office: Cody
>> Technology
>> > Park, Ively Road, Farnborough, Hampshire, GU14 0LX
>> http://www.qinetiq.com.
>> >
>>
>>
>>
>> --
>> Jonathan Ellis
>> Project Chair, Apache Cassandra
>> co-founder of DataStax, the source for professional Cassandra support
>> http://www.datastax.com
>>
>
>


-- 
---------------------------------------------
Paul Loy
p...@keteracel.com
http://uk.linkedin.com/in/paulloy

Re: Priority queue in a single row - performance falls over time

Reply via email to