I'm following up on this issue, which I've been monitoring for the last several weeks. I thought people might find my observations interesting.
Ever since increasing the heap size to 64GB, we've had no OOM conditions that resulted in a JVM termination. Our nodes have around 2.5TB of data each, and the replication factor is four. IO on the cluster seems to be fine, though I haven't been paying particular attention to any GC hangs. The bottleneck now seems to be the repair time. If any node becomes too inconsistent, or needs to be replaced, the rebuilt time is over a week. That issue alone makes this cluster configuration unsuitable for production use. - .Dustin On Jul 30, 2012, at 2:04 PM, Dustin Wenz <dustinw...@ebureau.com> wrote: > Thanks for the pointer! It sounds likely that's what I'm seeing. CFStats > reports that the bloom filter size is currently several gigabytes. Is there > any way to estimate how much heap space a repair would require? Is it a > function of simply adding up the filter file sizes, plus some fraction of > neighboring nodes? > > I'm still curious about the largest heap sizes that people are running with > on their deployments. I'm considering increasing ours to 64GB (with 96GB > physical memory) to see where that gets us. Would it be necessary to keep the > young-gen size small to avoid long GC pauses? I also suspect that I may need > to keep my memtable sizes small to avoid long flushes; maybe in the 1-2GB > range. > > - .Dustin > > On Jul 29, 2012, at 10:45 PM, Edward Capriolo <edlinuxg...@gmail.com> wrote: > >> Yikes. You should read: >> >> http://wiki.apache.org/cassandra/LargeDataSetConsiderations >> >> Essentially what it sounds like your are now running into is this: >> >> The BloomFilters for each SSTable must exist in main memory. Repair >> tends to create some extra data which normally gets compacted away >> later. >> >> Your best bet is to temporarily raise the Xmx heap and adjust the >> index sampling size. If you need to save the data (if it is just test >> data you may want to give up and start fresh) >> >> Generally the issue with the large disk configurations it is hard to >> keep a good ram/disk ratio. Then most reads turn into disk seeks and >> the throughput is low. I get the vibe people believe large stripes are >> going to help Cassandra. The issue is that stripes generally only >> increase sequential throughput, but Cassandra is a random read system. >> >> How much ram/disk you need is case dependent but 1/5 ratio of RAM to >> disk is where I think most people want to be, unless their system is >> carrying SSD disks. >> >> Again you have to keep your bloom filters in java heap memory so and >> design that tries to create a quatrillion small rows is going to have >> memory issues as well. >> >> On Sun, Jul 29, 2012 at 10:40 PM, Dustin Wenz <dustinw...@ebureau.com> wrote: >>> I'm trying to determine if there are any practical limits on the amount of >>> data that a single node can handle efficiently, and if so, whether I've hit >>> that limit or not. >>> >>> We've just set up a new 7-node cluster with Cassandra 1.1.2 running under >>> OpenJDK6. Each node is 12-core Xeon with 24GB of RAM and is connected to a >>> stripe of 10 3TB disk mirrors (a total of 20 spindles each) and connected >>> via dual SATA-3 interconnects. I can read and write around 900MB/s >>> sequentially on the arrays. I started out with Cassandra tuned with >>> all-default values, with the exception of the compaction throughput which >>> was increased from 16MB/s to 100MB/s. These defaults will set the heap size >>> to 6GB. >>> >>> Our schema is pretty simple; only 4 column families and each has one >>> secondary index. The replication factor was set to four, and compression >>> disabled. Our access patterns are intended to be about equal numbers of >>> inserts and selects, with no updates, and the occasional delete. >>> >>> The first thing we did was begin to load data into the cluster. We could >>> perform about 3000 inserts per second, which stayed mostly flat. Things >>> started to go wrong around the time the nodes exceeded 800GB. Cassandra >>> began to generate a lot of "mutations messages dropped" warnings, and was >>> complaining that the heap was over 75% capacity. >>> >>> At that point, we stopped all activity on the cluster and attempted a >>> repair. We did this so we could be sure that the data was fully consistent >>> before continuing. Our mistake was probably trying to repair all of the >>> nodes simultaneously - within an hour, Java terminated on one of the nodes >>> with a heap out-of-memory message. I then increased all of the heap sizes >>> to 8GB, and reduced the heap_newsize to 800MB. All of the nodes were >>> restarted, and there was no no outside activity on the cluster. I then >>> began a repair on a single node. Within a few hours, it OOMed again and >>> exited. I then increased the heap to 12GB, and attempted the same thing. >>> This time, the repair ran for about 7 hours before exiting from an OOM >>> condition. >>> >>> By now, the repair had increased the amount of data on some of the nodes to >>> over 1.2TB. There is no going back to a 6GB heap size - Cassandra now exits >>> with an OOM during startup unless the heap is set higher. It's at 16GB now, >>> and a single node has been repairing for a couple of days. Though I have no >>> personal experience with this, I've been told that Java's garbage collector >>> doesn't perform well with heaps above 8GB. I'm wary of setting it higher, >>> but I can add up to 192GB of RAM to each node if necessary. >>> >>> How much heap does cassandra need for this amount of data with only four >>> CFs? Am I scaling this cluster in completely the wrong direction? Is there >>> a magic garbage collection setting that I need to add in cassandra-env that >>> isn't there by default? >>> >>> Thanks, >>> >>> - .Dustin >