On Fri, Jul 30, 2010 at 1:05 AM, Peter Schuller <peter.schul...@infidyne.com
> wrote:

> > Keyspace: TimeFrameClicks
> >         Read Count: 42686
> >         Read Latency: 47.21777100220213 ms.
> >         Write Count: 18398
> >         Write Latency: 0.17457457332318732 ms.
>
> Is this all traffic across "a few days" as you mentioned the node had
> been running?
>

Sorry that is a cut and paste after the restart. Pattern remains the same. I
just did a full cluster restart to stop the Transport and Timeout exceptions
and the app recovers.
When it happens again I'll send you the info prior to the restart.



> Also: If you're hitting it too hard such that you have a cumulative
> effect over time until you notice it, I'd expect latencies to become
> worse and worse over time (and obviously so); you wouldn't be all fine
> and then suddenly freeze.
>

Latency is fine, basically the service suddenly freezes. On top of that to
reduce the number of reads I have memcache fronting this @ a 92% hit rate



> How much do you know about individual request success rate? is it
> possible that some subset of requests never finish at all (does your
> application log timeouts?), while others proceed without difficulty.
> That might lead to an overall build-up of pending jobs even though
> most requests make it through. Then once you eventually hit a queue
> size limit/concurrency limit you may experience a "sudden" hang.
>
>
I have very detailed stats on the exceptions thrown from the cassandra
client. For about 3-5 days I have a 99% success ratio with connections +
service of pulling a single hash key with a single column.

i.e. {$sn}_{$userid}_{$click} => {$when}

then I have about a 25-40% failure rate when the hang occurs.



> I'm still wondering what the threads are actually doing in the
> ROW-READ-STAGE.
>
> Are you blocking forever trying to read from disk for example? Such
> things can happen with hardware issues or OS issues; some particular
> part of a file may be unreadable. In such a case you'd expect queries
> for certain columns or ranges of column to hang.
>
> I think it might be useful to observe the stages stats immediately on
> restart. Keep an eye out for how many concurrent requests there are in
> the ROW-READ-STAGE. Based on the numbers you've posted I'd suspect
> that the concurrency will usually be 0. In any case you should see it
> fluctuate, and hopefully it will be visible when the fluctuations
> suddenly displace by one thread (so you go from fluctuating between
> e.g. 0-3 threads to between 1-4 threads never going below 1). This
> should indicate you have a hung thread, and it may be worth attaching
> with jconsole to see what that thread is blocking on, before so much
> time has passed that you have thousands of threads waiting.
>
> If it *is* a matter of parts of files being unreadable it may be
> easier to spot using standard I/O rather than mmap():ed I/O since you
> should clearly see it blocking on a read in that case.
>
> Interesting I will try this. Thanks Peter!



> --
> / Peter Schuller
>

Reply via email to