We are using XFS for the data volume. We are load testing now, and compaction is way behind but weekly manual compaction should help catch things up.
Smaller nodes just seem to fit the Cassandra architecture a lot better. We can not use cloud instances, so the cost for us to go to <500gb nodes is prohibitive. Cassandra lumps all processes on the node together into one bucket, and that almost then requires a smaller node data set. There are no regions, tablets, or partitions created to throttle compaction and prevent huge data files. On Sat, Dec 18, 2010 at 5:24 AM, Peter Schuller <peter.schul...@infidyne.com > wrote: > > How many nodes? 10 - 16 cores each (2 x quad ht cpus) > > How much ram per node? 24gb > > What disks and how many? SATA 7200rpm 1x1tb for commit log, 4x1tb (raid0) > > for data > > Is your ring balanced? yes, random partitioned very evenly > > How many column families? 4 CFs x 3 Keyspaces > > How much ram is dedicated to cassandra? 12gb heap (probably too high?) > > What type of caching are you using? Key caching > > What are the sizes of caches? 500k-1m values for 2 of the CFs > > What is the hit rate of caches? high, 90%+ > > What does your disk utiliztion|CPU|Memory look like at peak times? Disk > goes > > to 90%+ under heavy read load. CPU load high as well. Latency does not > > change that much for single reads vs. under load (30 threads). We can > keep > > current read latency up to 25-30 read threads if no writes or compaction > is > > going on. We are worried about what we see in terms of latency for a > single > > read. > > What are your average mean|max row size from cfstats? 30k avg/5meg max > for > > one CF and 311k avg/855k max for the other. > > On average for a given sstable how large is the data bloom and index > files? > > 30gig data, 189k filter, 5.7meg index for one CF, 98gig data, 587k > filter, > > 18meg index for the other. > > I do want to make one point very clear: Regardless of whether or not > you're running a perfectly configured Cassandra, or any other data > base, whenever your total data set is beyond RAM you *will* take I/O > load as a function of the locality of data access. There is just no > way around that. If 10% of your read requests go to data that has not > recently been read (the long tail), it won't be cached, and no storage > system ever will magically make that read not go down to disk. > > There are definitely things to tweak. For example, making sure that > the memory you do have for caching purposes is used as efficiently as > possible to minimize the amount of reads that go down to disk. But in > order to figure out what is going on you are going to have to gain an > understanding of what the data access pattern is like. For example, if > the hot set is very small and slowly changing, you may be able to have > 100 TB per node and take the traffic without any difficulties. On the > other hand if your data is accessed completely randomly, you may have > trouble with a data set just twice the size of RAM (and most requests > and up going down to disk). > > In the worst case access patterns where caching is completely > ineffective (such as random access to a data set several times the > size of RAM), it will be solely about disk IOPS and SSD:s are probably > the way to go unless lots and lots of servers become cheaper > (depending on data sizes mostly). > > Now, that said, the practical reality with Cassandra is not the > theoretical optimum w.r.t. caching. Some things to consider: > > (1) Whatever memory you give the JVM is going to be a direct trade-off > in the sense that said memory is no longer available for the operating > system page cache. If you "waste" memory there for no good reason, > you're just loosing caching potential. > I have considered dropping the heap down to 8gb, but having pained through many cmf in the past I thought the larger heap should help prevent the stop the world gc. > > (2) That said, the row cache in Cassandra is not affected by > compaction/repair. Anything cached by the operating system page cache > (anything not in row cache) will be affected, in terms of cache > warmness, by compactions and repair operations. This means you have to > expect a variation in cache locality whenever compaction/repair runs, > in addition to the I/O load caused directly by same. (There is > on-going work to improve this, but nothing usable right now.) > Row cache is not an option for us. We expect going to disk, and key cache is the only cache that can help speed things up a little. We have wide rows so key cache is an un-expensive boost. > > (3) Reads that do go down to disk will go down to all sstables that > contain data for that *row* (the bloom filters are at the row key > level). If you have reads to rows that span multiple sstables (because > they are continually written for example) this means that sstable > counts is very important. If you only write rows once and never touch > them again, it is much much less of an issue. > > This is why we schedule weekly major compaction. We update ALL rows every day, often over-writing previous values. > (4) There is a limit to bloom filter size that means that if you go > above 143 M keys in a single sstable (I believe that is the number > based on IRC conversations, I haven't checked) the bloom filter false > positive rate will increase. Very significantly if you are much much > bigger than 143 M keys. This leads to unnecessary reads from sstables > that do not contain data for the row being requested. Given that you > say you have lots of data, consider this. But note that only the row > *count* matters, not the size (in terms of megabytes) of the sstables. > Remember that on major compactions, all data in a column family is > going to end up in a single sstable, so do consider the total row > count (per node) here, not just what the sstable layout happens to be > be now. (There is on-going work to remove this limitation w.r.t. row > counts.) > > Not an issue for us with wide rows. > (5) In general the way I/O works, latency will skyrocket once you > start saturating your disks. As long as you're significantly below > full utilization of your disks, you'll see pretty stable and low > latencies. As you approach full saturation, the latencies will tend to > increase super-linearly. Once you're *above* saturation, your > latencies skyrocket and reads are dropped because the rate cannot be > sustained. This means that while latency is a great indicator to look > at to judge what the current user perceived behavior is, it is *not* a > good thing to look at to extrapolate resource demands or figure out > how far you are from saturation / need for more hardware. > > This we can see with munin. We throttle the read load to avoid that "wall". > -- > / Peter Schuller >