You are absolutely back to my main concern. Initially we were consistently seeing < 10ms read latency and now we see 25ms (30GB sstable file), 50ms (100GB sstable file) and 65ms (330GB table file) read times for a single read with nothing else going on in the cluster. Concurrency is not our problem/concern (at this point), our problem is slow reads in total isolation. Frankly the concern is that a 2TB node with a 1TB sstable (worst case scenario) will result in > 100ms read latency in total isolation.
I got a lot of help from Jonathan Ellis to get very good heap settings some of which are now standard in .6.8. CMF problems have yet to raise their heads again. He also helped fix some major inefficiencies with quorum reads (we read/write with quorum). I do think that we could optimize our caching for sure, but our assumption for worst case scenario is disk (non ssd) based reads. Thanks. On Sat, Dec 18, 2010 at 12:58 PM, Peter Schuller < peter.schul...@infidyne.com> wrote: > > Smaller nodes just seem to fit the Cassandra architecture a lot better. > We > > can not use cloud instances, so the cost for us to go to <500gb nodes is > > prohibitive. Cassandra lumps all processes on the node together into one > > bucket, and that almost then requires a smaller node data set. There are > no > > regions, tablets, or partitions created to throttle compaction and > prevent > > huge data files. > > There are definitely some things to improve. I think what you have > mentioned is covered, but if you feel you're hitting something which > is not covered by the wiki page I mentioned in my previous post > (http://wiki.apache.org/cassandra/LargeDataSetConsiderations), please > do augment or say so. > > In your original post you said you went from 5 ms to 50 ms. Is this > average latencies under load, or the latency of a single request > absent other traffic and absent background compaction etc? > > If a single read is taking 50 ms for reasons that have nothing to do > with other concurrent activity, that smells of something being wrong > to me. > > Otherwise, is your primary concern worse latency/throughput during > compactions/repairs, or just the overall throughput/latency during > normal operation? > > > I have considered dropping the heap down to 8gb, but having pained > through > > many cmf in the past I thought the larger heap should help prevent the > stop > > the world gc. > > I'm not sure what got merged to 0.6.8, but you may way want to grab > the JVM options from the 0.7 branch. In particular, the initial > occuprancy triggering of CMS mark-sweep phases. Concurrent mode > failures could just be because the CMS heuristics failed, rather than > due to the heap legitimately being too small. If the heuristics are > failing, maybe you do have the ability to lower the heap size if you > change the CMS trigger. I recommend monitoring heap usage for that; > look for the heap usage as it appears right after a CMS collection has > completed to judge the "real" live set size. > > > Row cache is not an option for us. We expect going to disk, and key cache > is > > the only cache that can help speed things up a little. We have wide rows > so > > key cache is an un-expensive boost. > > Ok, makes sense. > > > This is why we schedule weekly major compaction. We update ALL rows every > > day, often over-writing previous values. > > Ok - so you're definitely in a position to suffer more than most use > cases from data being spread over multiple sstables. > > >> (5) In general the way I/O works, latency will skyrocket once you > >> start saturating your disks. As long as you're significantly below > >> full utilization of your disks, you'll see pretty stable and low > >> latencies. As you approach full saturation, the latencies will tend to > >> increase super-linearly. Once you're *above* saturation, your > >> latencies skyrocket and reads are dropped because the rate cannot be > >> sustained. This means that while latency is a great indicator to look > >> at to judge what the current user perceived behavior is, it is *not* a > >> good thing to look at to extrapolate resource demands or figure out > >> how far you are from saturation / need for more hardware. > >> > > This we can see with munin. We throttle the read load to avoid that > "wall". > > Do you have a sense of how many reads on disk you're taking per read > request to the node? Do you have a sense of the size of the active > set? A big question is going to be whether caching is effective at > all, and how much additional caching would help. > > In any case, it would be interesting to know whether you are seeing > more disk seeks per read than you "should". > > -- > / Peter Schuller >