> Yep - I've been looking at these - I don't see anything in iostat/dstat etc > that point strongly to a problem. There is quite a bit of I/O load, but it > looks roughly uniform on slow and fast instances of the queries. The last > compaction ran 4 days ago - which was before I started seeing variable > performance
[snip] > I now why it is slow - it's clearly I/O bound. I am trying to hunt down why > it is sometimes much faster even though I have (tried) to replicate the > same conditions What does clearly I/O bound mean, and what is "quite a bit" of I/O load? In general, if you have queries that come in at some rate that is determined by outside sources (rather than by the time the last query took to execute), you will typically either get more queries than your cluster can take, or fewer. If fewer, there is a non-trivially sized grey area where overall I/O throughput needed is lower than that available, but the closer you are to capacity the more often requests have to wait for other I/O to complete, for purely statistical reasons. If you're running close to maximum capacity, it would be expected that the variation in query latency is high. That said, if you're seeing consistently bad latencies for a while where you sometimes see consistently good latencies, that sounds different but would hopefully be observable somehow. -- / Peter Schuller (@scode, http://worldmodscode.wordpress.com)