after I put my cassandra cluster on heavy load (1k/s write + 1k/s read ) for 1 day, I accumulated about 30GB of data in sstables. I think the caches have warmed up to their stable state.
when I started this, I manually cat all the sstables to /dev/null , so that they are loaded into memory (the system mem is 32GB, so a lot of extra space ), at that time, "sar -B " shows about 100page in requests per second. but after 1 day, I begin to see consistently 2000 page-in requests per sec. and the end response latency seen by application is also higher. so I was curious about how often my Cassandra server resorted to reading the sstables, I looked at the JMX attributes CFS.BloomFilterFalseRatio, it's 1.0 , BloomFilterFalsePositives, it's 2810, ReadCount is about 1 million (these are numbers after a restart, so smaller ), so I get about 0.2% of reads going to disk. I am wondering what is the ball park number that you see with your production cluster? is 0.2% a good number? besides bloomFilter, what other approaches do we have to avoid disk reads? ----- as data grows, apparently we can't fit all that in memory, increase machine count so that the data volume per-box fits into memory? Thanks Yang