Reddit posted a blog entry about some recent downtime, partially due to issues with Cassandra. http://blog.reddit.com/2010/05/reddits-may-2010-state-of-servers.html
This part surprised me: " First, Cassandra has an internal queue of work to do. When it times out a client (10s by default), it still leaves the operation in the queue of work to complete (even though the person that asked for the read is no longer even holding the socket), which given a constant stream of requests makes the amount of pending work snowball effectively infinitely (specifically, ROW-READ-STAGE's PENDING operations grow unbounded). " I've searched Jira for an issue related to this -- it seems like a bug to have reads in queue when the result is useless (because the reader is gone). Obviously a 10-second read is not a normal run condition, but removing stale reads could remove a cause of cascading failure. Should I open a ticket, or have I misunderstood something?