Hi all,
Has anyone written a work-queue implementation using Cassandra?
There's a section in the UseCase wiki page for "A distributed Priority
Job Queue" which looks perfect, but unfortunately it hasn't been
filled in yet.
http://wiki.apache.org/cassandra/UseCases#A_distributed_Priority_Job_Queue
I've been thinking about how best to do this, but every solution I've
thought of seems to have some serious drawback. The "range ghost"
problem in particular creates some issues. I'm assuming each job has
a row within some column family, where the row's key is the time at
which the job should be run. To find the next job, you'd do a range
query with a start a few hours in the past, and an end at the current
time. Once a job is completed, you delete the row.
The problem here is that you have to scan through deleted-but-not-yet-
GCed rows each time you run the query. Is there a better way?
Preventing more than one worker from starting the same job seems like
it would be a problem too. You'd either need an external locking
manager, or have to use some other protocol where workers write their
ID into the row and then immediately read it back to confirm that they
are the owner of the job.
Any ideas here? Has anyone come up with a nice implementation? Is
Cassandra not well suited for queue-like tasks?
Thanks,
Andrew