Hi Cassandra community, we are currently experimenting with different Cassandra scaling strategies. We observed that Cassandra performance decreases drastically when we insert more data into the cluster (say, going from 60GB to 600GB in a 3-node cluster). So we want to find out how to deal with this problem. One scaling strategy seems interesting but we don't fully understand what is going on, yet. The strategy works like this: add new nodes to a Cassandra cluster with "auto_bootstrap = false" to avoid streaming to the new nodes. We were a bit surprised that this strategy improved performance considerably and that it worked much better than other strategies that we tried before, both in terms of scaling speed and performance impact during scaling.
Let me share our little experiment with you: In a initial setup S1 we have 4 nodes where each node is similar to the Amazon EC2 large instance type, i.e., 4 cores, 15GB memory, 700GB free disk space, Cassandra replication factor 2. Each node is loaded with 10 million 1KB rows into a single column family, i.e., ~20 GB data/node, using the Yahoo Cloud Serving Benchmark (YCSB) tool. All Cassandra settings are default. In the setup S1 we achieved an average throughput of ~800 ops/s. The workload is a 95/5 read/update mix with a Zipfian workload distribution (= YCSB workload B). Setup S2: We then added two empty nodes to our 4-node cluster with auto_bootstrap set to false. The throughput that we observered thereafter tripled from 800 ops/s to 2,400 ops/s. We looked at various outputs from nodetool commands to understand this effect. On the new nodes, "$ nodetool info" tells us that the keycache is empty; "$ nodetool cfstats" clearly shows write and read requests coming in. The memtable columns count and data size are multiple times larger compared to the other four nodes. We are wondering: what exactly gets stored on the two new nodes in setup S2 and where (cache, memtable, disk?). Would it be necessary (in a production environment) to stream the old SSTables from the other four nodes at some point in time? Or can we simply be happy with the performance improvement and leave it like this? Are we missing something here; can you advise us to look at specific monitoring data to better understand the observed effect? Thanks, Markus Klems