Hi Jean, "I had to reboot a node. I killed the cassandra process on that node". You should drain the node before killing java (or using any service stop command). This is not what causes the issue yet it will help you to keep consistence if you use counters, and make the reboot faster in any cases.
What is going on highly depends on what you did before. Did you change the RF ? Did you change the Topology ? Are you sure this node had data before you restart it ? What does a "nodetool status mykeyspace" outputs ? To make the join faster you could try to bootstrap the node again. I just hope you have a RF > 1 (btw, you have one replica down, if you want to still have Reads / Writes working, take care of having a Consistency Level low enough). "It’s like the whole cluster is paralysed" --> what does it mean, be more accurate on this please. You should tell us action that were taken before this occurred and now what is not working since a C* cluster in this state could perfectly run. No SPOF. C*heers 2015-06-23 16:10 GMT+02:00 Jean Tremblay <jean.tremb...@zen-innovations.com> : > Does anyone know what to do when such an event occurs? > Does anyone know how this could happen? > > I would have tried repairing the node with nodetool repair but that > takes much too long… I need my cluster to work for our online system. At > this point nothing is working. It’s like the whole cluster is paralysed. > The only solution I see is to remove temporarily that node. > > Thanks for your comments. > > On 23 Jun 2015, at 12:45 , Jean Tremblay < > jean.tremb...@zen-innovations.com> wrote: > > Hi, > > I have a cluster with 5 nodes running Cassandra 2.1.6. > > I had to reboot a node. I killed the cassandra process on that node. > Rebooted the machine, and restarted Cassandra. > > ~/apache-cassandra-DATA/data:321>nodetool status > Datacenter: datacenter1 > ======================= > Status=Up/Down > |/ State=Normal/Leaving/Joining/Moving > -- Address Load Tokens Owns Host ID > Rack > UN 192.168.2.104 158.27 GB 256 ? > 6479205e-6a19-49a8-b1a1-7e788ec29caa rack1 > UN 192.168.2.100 4.75 GB 256 ? > e821da50-23c6-4ea0-b3a1-275ded63bc1f rack1 > UN 192.168.2.101 157.43 GB 256 ? > 01525665-bacc-4207-a8c3-eb4fd9532401 rack1 > UN 192.168.2.102 159.27 GB 256 ? > 596a33d7-5089-4c7e-a9ad-e1f22111b160 rack1 > UN 192.168.2.103 167 GB 256 ? > 0ce1d48e-57a9-4615-8e12-d7ef3d621c7d rack1 > > > After restarting node 192.168.2.100 I noticed that its load was diminish > to 2%. And now when I query the cluster my queries are bombing and that > node times out with an error > > WARN [MessagingService-Incoming-/192.168.2.102] 2015-06-23 12:26:00,056 > IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from > socket; closing > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 > at > org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330) > ~[apache-cassandra-2.1.6.jar:2.1.6] > > What is going on? Did anyone live something like that? > > >