> Keyspace: TimeFrameClicks > Read Count: 42686 > Read Latency: 47.21777100220213 ms. > Write Count: 18398 > Write Latency: 0.17457457332318732 ms.
Is this all traffic across "a few days" as you mentioned the node had been running? > Based on Feedback from this list by jbellis, I'm hitting cassandra to hard. > So I removed the offending server from the LB. Waited about 20 mins and the > pending queue did not clear at all. > > Killing Cassandra and restarting it, this box recovered. If those 42k reads represent all reads done over a several-day period, it seems unlikely that you're really hitting it too hard. Or are individual reads really expensive? (If you already posted that info, sorry) Also: If you're hitting it too hard such that you have a cumulative effect over time until you notice it, I'd expect latencies to become worse and worse over time (and obviously so); you wouldn't be all fine and then suddenly freeze. How much do you know about individual request success rate? is it possible that some subset of requests never finish at all (does your application log timeouts?), while others proceed without difficulty. That might lead to an overall build-up of pending jobs even though most requests make it through. Then once you eventually hit a queue size limit/concurrency limit you may experience a "sudden" hang. I'm still wondering what the threads are actually doing in the ROW-READ-STAGE. Are you blocking forever trying to read from disk for example? Such things can happen with hardware issues or OS issues; some particular part of a file may be unreadable. In such a case you'd expect queries for certain columns or ranges of column to hang. I think it might be useful to observe the stages stats immediately on restart. Keep an eye out for how many concurrent requests there are in the ROW-READ-STAGE. Based on the numbers you've posted I'd suspect that the concurrency will usually be 0. In any case you should see it fluctuate, and hopefully it will be visible when the fluctuations suddenly displace by one thread (so you go from fluctuating between e.g. 0-3 threads to between 1-4 threads never going below 1). This should indicate you have a hung thread, and it may be worth attaching with jconsole to see what that thread is blocking on, before so much time has passed that you have thousands of threads waiting. If it *is* a matter of parts of files being unreadable it may be easier to spot using standard I/O rather than mmap():ed I/O since you should clearly see it blocking on a read in that case. -- / Peter Schuller