I'm trying to gain some insight into what happens with a cluster when
indexes are being built, or when CFs with indexed columns are being written
to.

Over the past couple of days we've been doing some loads into a CF with 29
indexed columns.  Eventually, the nodes just got overwhelmed and the client
(Hector) started getting timeouts.  We were using using a MapReduce job to
load an HDFS file into Cassandra, though we had limited the load job to one
task per node.  My confusion comes from how difficult it was to know that
the nodes were becoming overwhelmed.  The ring consistently reported that
all nodes were up and it did not appear that there were pending operations
under tpstats.  I also monitor this cluster with Ganglia, and at no point
did any of the machine loads appear very high at all, yet our job kept
failing with Hector reporting timeouts.

Today we decided to leave index creation until the end, and just load the
data using the same Hector code.  We bumped up the hadoop concurrency to two
concurrent tasks per node, and everything went fine, as expected, we've done
much larger loads than this using Hadoop and as long as you don't shoot for
too much concurrency, Cassandra can deal with it.  So now we have the data
in the column family and I updated the column family metadata in the CLI to
enable the 29 indexes.  As soon as I do that, the ring starts reporting that
nodes are down intermittently, and HintedHandoffs are starting to accumulate
under tpstats. Ganglia is reporting very low overall load, so I'm wondering
why it's taking so long for cli and nodetool commands to return.

I'm just trying to get a better handle on what kind of actions have a
serious impact on cluster availability and to know the right places to look
to try to get ahead of those conditions.

Thanks for any insight you can provide,
Matt

Reply via email to