> the servers spending >50% of the time in io-wait

Note that I/O wait is not necessarily a good indicator, depending on
situation. In particular if you have multiple drives, I/O wait can
mostly be ignored. Similarly if you have non-trivial CPU usage in
addition to disk I/O, it is also not a good indicator. I/O wait is
essentially giving you the amount of time CPU:s spend doing nothing
because the only processes that would otherwise be runnable are
waiting on disk I/O. But even a single process waiting on disk I/O ->
lots of I/O wait even if you have 24 drives.

The per-disk % utilization is generally a much better indicator
(assuming no hardware raid device, and assuming no SSD), along with
the average queue size.

>> In general, if you have queries that come in at some rate that
>> is determined by outside sources (rather than by the time the last
>> query took to execute),
>
> That's an interesting approach - is that likely to give close to optimal
> performance ?

I just mean that it all depends on the situation. If you have, for
example, some N number of clients that are doing work as fast as they
can, bottlenecking only on Cassandra, you're essentially saturating
the Cassandra cluster no matter what (until the client/network becomes
a bottleneck). Under such conditions (saturation) you generally never
should expect good latencies.

For most non-batch job production use-cases, you tend to have incoming
requests driven by something external such as user behavior or
automated systems not related to the Cassandra cluster. In this cases,
you tend to have a certain amount of incoming requests at any given
time that you must serve within a reasonable time frame, and that's
where the question comes in of how much I/O you're doing in relation
to maximum. For good latencies, you always want to be significantly
below maximum - particularly when platter based disk I/O is involved.

> That may well explain it - I'll have to think about what that means for our
> use case as load will be extremely bursty

To be clear though, even your typical un-bursty load is still bursty
once you look at it at sufficient resolution, unless you have
something specifically ensuring that it is entirely smooth. A
completely random distribution over time for example would look very
even on almost any graph you can imagine unless you have sub-second
resolution, but would still exhibit un-evenness and have an affect on
latency.

-- 
/ Peter Schuller (@scode, http://worldmodscode.wordpress.com)

Reply via email to