I'm trying to gain some insight into what happens with a cluster when indexes are being built, or when CFs with indexed columns are being written to.
Over the past couple of days we've been doing some loads into a CF with 29 indexed columns. Eventually, the nodes just got overwhelmed and the client (Hector) started getting timeouts. We were using using a MapReduce job to load an HDFS file into Cassandra, though we had limited the load job to one task per node. My confusion comes from how difficult it was to know that the nodes were becoming overwhelmed. The ring consistently reported that all nodes were up and it did not appear that there were pending operations under tpstats. I also monitor this cluster with Ganglia, and at no point did any of the machine loads appear very high at all, yet our job kept failing with Hector reporting timeouts. Today we decided to leave index creation until the end, and just load the data using the same Hector code. We bumped up the hadoop concurrency to two concurrent tasks per node, and everything went fine, as expected, we've done much larger loads than this using Hadoop and as long as you don't shoot for too much concurrency, Cassandra can deal with it. So now we have the data in the column family and I updated the column family metadata in the CLI to enable the 29 indexes. As soon as I do that, the ring starts reporting that nodes are down intermittently, and HintedHandoffs are starting to accumulate under tpstats. Ganglia is reporting very low overall load, so I'm wondering why it's taking so long for cli and nodetool commands to return. I'm just trying to get a better handle on what kind of actions have a serious impact on cluster availability and to know the right places to look to try to get ahead of those conditions. Thanks for any insight you can provide, Matt