I am preparing to migrate a large amount of data to Cassandra. In order to test my migration code, I’ve been doing some dry runs to a test cluster. My test cluster is 2.0.15, 3 nodes, RF=1 and CL=QUORUM. I know RF=1 and CL=QUORUM is a weird combination, but my production cluster that will eventually receive this data is RF=3. I am running with RF=1 so its faster while I work out the kinks in the migration.
There are a few things that have puzzled me, after writing several 10’s of millions records to my test cluster. My main concern is that I have a few tens of thousands of dropped mutation messages. I’m overloading my cluster. I never have more than about 10% CPU utilization (even my I/O wait is negligible). A curious thing about that is that the driver hasn’t thrown any exceptions, even though mutations have been dropped. I’ve seen dropped mutation messages on my production cluster, but like this, I’ve never gotten errors back from the client. I had always assumed that one node dropped mutation messages, but the other two did not, and so quorum was satisfied. With RF=1, I don’t understand how mutation messages are being dropped and the client doesn’t tell me about it. Does this mean my cluster is missing data, and I have no idea? Each node has a couple dozen all-time blocked FlushWriters. Is that bad? I have around 100 dropped counter mutations, which is very weird because I don’t write any counters. I have counters in my schema for tracking view counts, but the migration code doesn’t write them. How could I get dropped counter mutation messages when I don’t modify them? Any insights would be appreciated. Thanks in advance. Robert