We are running .6.8 and are reaching 1tb/node in a 10 node cluster rf=3. Our reads times seem to be getting worse as we load data into the cluster, and I am worried there is a scale problem in terms of large column families. All benchmarks/times come from cfstats reporting so no client code or times are referenced. Our initial tests always hovered around <=5ms in terms of read latency. We then went through a lot of work to load large amounts of data into the system on now our read latency is ~20-25ms. We can find no reason for this, and we have checked under load, no load, with manually compacted CFs etc. and the numbers all seem consistent to what we saw before but 3-4x slower. We then compared to a another larger CF that we have loaded and we are seeing it taking ~50-60ms to read from. The scare for us is that the data file for the slower CF is 3x the size of the smaller one with 3x the latency to read. We were expecting a < 5ms read latency (with key caching) when we had a lot less data in the cluster but are worried it will only get worse as the table gets bigger. The 20ms table file is ~30gb and the bigger one is ~100gb. I keep thinking we are missing something obvious and this is just a coincidence, as it does not make sense. We also upgraded from .6.6 between tests also so we will try to see if .6.6 is faster but is the only real change that has occurred in our cluster.
I have read that read latency goes up with the total data size, but to what degree should we expect a degradation in performance? What is the "normal" read latency range if there is such a thing for a small slice of scol/cols? Can we really put 2TB of data on a node and get good read latency querying data off of a handful of CFs? Any experience or explanations would be greatly appreciated. Thanks in advance for any help!