> Keyspace: TimeFrameClicks
>         Read Count: 42686
>         Read Latency: 47.21777100220213 ms.
>         Write Count: 18398
>         Write Latency: 0.17457457332318732 ms.

Is this all traffic across "a few days" as you mentioned the node had
been running?

> Based on Feedback from this list by jbellis, I'm hitting cassandra to hard.
> So I removed the offending server from the LB. Waited about 20 mins and the
> pending queue did not clear at all.
>
> Killing Cassandra and restarting it, this box recovered.

If those 42k reads represent all reads done over a several-day period,
it seems unlikely that you're really hitting it too hard. Or are
individual reads really expensive? (If you already posted that info,
sorry)

Also: If you're hitting it too hard such that you have a cumulative
effect over time until you notice it, I'd expect latencies to become
worse and worse over time (and obviously so); you wouldn't be all fine
and then suddenly freeze.

How much do you know about individual request success rate? is it
possible that some subset of requests never finish at all (does your
application log timeouts?), while others proceed without difficulty.
That might lead to an overall build-up of pending jobs even though
most requests make it through. Then once you eventually hit a queue
size limit/concurrency limit you may experience a "sudden" hang.

I'm still wondering what the threads are actually doing in the ROW-READ-STAGE.

Are you blocking forever trying to read from disk for example? Such
things can happen with hardware issues or OS issues; some particular
part of a file may be unreadable. In such a case you'd expect queries
for certain columns or ranges of column to hang.

I think it might be useful to observe the stages stats immediately on
restart. Keep an eye out for how many concurrent requests there are in
the ROW-READ-STAGE. Based on the numbers you've posted I'd suspect
that the concurrency will usually be 0. In any case you should see it
fluctuate, and hopefully it will be visible when the fluctuations
suddenly displace by one thread (so you go from fluctuating between
e.g. 0-3 threads to between 1-4 threads never going below 1). This
should indicate you have a hung thread, and it may be worth attaching
with jconsole to see what that thread is blocking on, before so much
time has passed that you have thousands of threads waiting.

If it *is* a matter of parts of files being unreadable it may be
easier to spot using standard I/O rather than mmap():ed I/O since you
should clearly see it blocking on a read in that case.

-- 
/ Peter Schuller

Reply via email to