Sorry, I wasn't clear on the timeline of events. I started the index build and then posted this message to the list. Once I read the links you posted, I did expect the cluster to crash, but I let it run until it blew up anyway, since I didn't really know how to stop the index build.
Which is sort of where I'm still stuck, I don't want to corrupt that column family by issuing an "update column family" that has a smaller set of indexes while the index build is going on without some encouragement from the list that doing that won't wreck the column family. Is there a safe way to tell an index build to stop after the cluster starts up from a crash due to the index build? Thanks, Matt On Thu, Mar 10, 2011 at 1:40 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > If you read the bugs I linked, you would see that this is expected > behavior with 0.7.3 once you get more data than you can index > in-memory. > > You should wait for the next Hudson build (which will include 2295) > and use that. Or, create your indexes before adding the data. > > On Thu, Mar 10, 2011 at 12:26 PM, Matt Kennedy <stinkym...@gmail.com> > wrote: > > Well it looks like the index creation job crashed the cluster. All of > the > > nodes were down having dumped out .hprof files. I brought the cluster > back > > up and when I do "describe keyspace ks" it looks like the index build > > process has started over again. Is it safe to attempt to stop that by > > running an "update column family" command with fewer indexes defined? Or > is > > there a better way to safely terminate this index creation process that I > > assume will crash the cluster again eventually? > > > > Would creating the indexes one at a time help? Or will the same problem > > occur once I get to a certain number of indexes on the column family? > > > > Thanks, > > Matt > > > > On Wed, Mar 9, 2011 at 8:40 PM, Jonathan Ellis <jbel...@gmail.com> > wrote: > >> > >> https://issues.apache.org/jira/browse/CASSANDRA-2294 > >> https://issues.apache.org/jira/browse/CASSANDRA-2295 > >> > >> On Wed, Mar 9, 2011 at 5:47 PM, Matt Kennedy <stinkym...@gmail.com> > wrote: > >> > I'm trying to gain some insight into what happens with a cluster when > >> > indexes are being built, or when CFs with indexed columns are being > >> > written > >> > to. > >> > > >> > Over the past couple of days we've been doing some loads into a CF > with > >> > 29 > >> > indexed columns. Eventually, the nodes just got overwhelmed and the > >> > client > >> > (Hector) started getting timeouts. We were using using a MapReduce > job > >> > to > >> > load an HDFS file into Cassandra, though we had limited the load job > to > >> > one > >> > task per node. My confusion comes from how difficult it was to know > >> > that > >> > the nodes were becoming overwhelmed. The ring consistently reported > >> > that > >> > all nodes were up and it did not appear that there were pending > >> > operations > >> > under tpstats. I also monitor this cluster with Ganglia, and at no > >> > point > >> > did any of the machine loads appear very high at all, yet our job kept > >> > failing with Hector reporting timeouts. > >> > > >> > Today we decided to leave index creation until the end, and just load > >> > the > >> > data using the same Hector code. We bumped up the hadoop concurrency > to > >> > two > >> > concurrent tasks per node, and everything went fine, as expected, > we've > >> > done > >> > much larger loads than this using Hadoop and as long as you don't > shoot > >> > for > >> > too much concurrency, Cassandra can deal with it. So now we have the > >> > data > >> > in the column family and I updated the column family metadata in the > CLI > >> > to > >> > enable the 29 indexes. As soon as I do that, the ring starts > reporting > >> > that > >> > nodes are down intermittently, and HintedHandoffs are starting to > >> > accumulate > >> > under tpstats. Ganglia is reporting very low overall load, so I'm > >> > wondering > >> > why it's taking so long for cli and nodetool commands to return. > >> > > >> > I'm just trying to get a better handle on what kind of actions have a > >> > serious impact on cluster availability and to know the right places to > >> > look > >> > to try to get ahead of those conditions. > >> > > >> > Thanks for any insight you can provide, > >> > Matt > >> > > >> > >> > >> > >> -- > >> Jonathan Ellis > >> Project Chair, Apache Cassandra > >> co-founder of DataStax, the source for professional Cassandra support > >> http://www.datastax.com > > > > > > > > -- > Jonathan Ellis > Project Chair, Apache Cassandra > co-founder of DataStax, the source for professional Cassandra support > http://www.datastax.com >