On Fri, Jul 30, 2010 at 1:05 AM, Peter Schuller <peter.schul...@infidyne.com > wrote:
> > Keyspace: TimeFrameClicks > > Read Count: 42686 > > Read Latency: 47.21777100220213 ms. > > Write Count: 18398 > > Write Latency: 0.17457457332318732 ms. > > Is this all traffic across "a few days" as you mentioned the node had > been running? > Sorry that is a cut and paste after the restart. Pattern remains the same. I just did a full cluster restart to stop the Transport and Timeout exceptions and the app recovers. When it happens again I'll send you the info prior to the restart. > Also: If you're hitting it too hard such that you have a cumulative > effect over time until you notice it, I'd expect latencies to become > worse and worse over time (and obviously so); you wouldn't be all fine > and then suddenly freeze. > Latency is fine, basically the service suddenly freezes. On top of that to reduce the number of reads I have memcache fronting this @ a 92% hit rate > How much do you know about individual request success rate? is it > possible that some subset of requests never finish at all (does your > application log timeouts?), while others proceed without difficulty. > That might lead to an overall build-up of pending jobs even though > most requests make it through. Then once you eventually hit a queue > size limit/concurrency limit you may experience a "sudden" hang. > > I have very detailed stats on the exceptions thrown from the cassandra client. For about 3-5 days I have a 99% success ratio with connections + service of pulling a single hash key with a single column. i.e. {$sn}_{$userid}_{$click} => {$when} then I have about a 25-40% failure rate when the hang occurs. > I'm still wondering what the threads are actually doing in the > ROW-READ-STAGE. > > Are you blocking forever trying to read from disk for example? Such > things can happen with hardware issues or OS issues; some particular > part of a file may be unreadable. In such a case you'd expect queries > for certain columns or ranges of column to hang. > > I think it might be useful to observe the stages stats immediately on > restart. Keep an eye out for how many concurrent requests there are in > the ROW-READ-STAGE. Based on the numbers you've posted I'd suspect > that the concurrency will usually be 0. In any case you should see it > fluctuate, and hopefully it will be visible when the fluctuations > suddenly displace by one thread (so you go from fluctuating between > e.g. 0-3 threads to between 1-4 threads never going below 1). This > should indicate you have a hung thread, and it may be worth attaching > with jconsole to see what that thread is blocking on, before so much > time has passed that you have thousands of threads waiting. > > If it *is* a matter of parts of files being unreadable it may be > easier to spot using standard I/O rather than mmap():ed I/O since you > should clearly see it blocking on a read in that case. > > Interesting I will try this. Thanks Peter! > -- > / Peter Schuller >