No, I did not.
On 24 Jun 2015, at 06:05, Jason Wee <peich...@gmail.com<mailto:peich...@gmail.com>> wrote: on the node 192.168.2.100, did you run repair after its status is UN? On Wed, Jun 24, 2015 at 2:46 AM, Jean Tremblay <jean.tremb...@zen-innovations.com<mailto:jean.tremb...@zen-innovations.com>> wrote: Dear Alain, Thank you for your reply. Ok, yes I did not drain. The cluster was loaded with tons of records, and no new records were added since few weeks. Each node had a load of about 160 GB as seen in the “nodetool status". I killed the cassandradeamon, and restarted it. After cassandra was restarted, I could see in the “nodetool status” a load of 5 GB!! I don’t use counters. I use RF 3 on 5 nodes. I did not change the replication factor. I have two types of read queries. One use QUORUM for read, and the other use ONE for consistency level. I did not change the Topology. Are you sure this node had data before you restart it ? Actually the full story is: - I stopped node0(192.168.2.100), and I restarted it. - I stopped node1(192.168.2.101). - I made a nodetool status and I noticed that node0 was UN and had a load 5 GB. I found this really weird because all the other ones had about 160GB. I also saw that node1 was DN with a load of about 160GB. - I restarted node1. - I made a nodetool status and I noticed that node1 was UN and had a load of also 5GB, it previously had a load of about 160GB. That I’m sure. - Then my program could no longer query C*. Neither the QUORUM nor the ONE consistency level statements could read data. What does a "nodetool status mykeyspace" outputs ? I cannot try this anymore. I flushed the whole cluster, and I am currently reloading everything. I was too much in a hurry. I have a demo tomorrow, and I will manage to have it back before tomorrow. After my bad decision of flushing the cluster, I realised that I could have bootstrapped again my two nodes. Learning by doing. "It’s like the whole cluster is paralysed" --> what does it mean, be more accurate on this please. You should tell us action that were taken before this occurred and now what is not working since a C* cluster in this state could perfectly run. No SPOF. What I did before? Well this cluster was basically idling. I was only making lots of select on it. I was loaded since few weeks. But what I noticed when I restarted node0 is the following: INFO [InternalResponseStage:1] 2015-06-23 11:45:32,723 ColumnFamilyStore.java:882 - Enqueuing flush of schema_columnfamilies: 131587 (0%) on-heap, 0 (0%) off-heap INFO [MemtableFlushWriter:2] 2015-06-23 11:45:32,723 Memtable.java:346 - Writing Memtable-schema_columnfamilies@917967643(34850 serialized bytes, 585 ops, 0%/0% of on/off-heap limit) >>>> WARN [GossipTasks:1] 2015-06-23 11:45:33,459 FailureDetector.java:251 - >>>> Not marking nodes down due to local pause of 25509152054 > 5000000000 INFO [MemtableFlushWriter:1] 2015-06-23 11:45:33,982 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-11-Data.db (5274 bytes) for commitlog position ReplayPos ition(segmentId=1435052707645, position=144120) INFO [GossipStage:1] 2015-06-23 11:45:33,985 StorageService.java:1642 - Node /192.168.2.101<http://192.168.2.101> state jump to normal INFO [GossipStage:1] 2015-06-23 11:45:33,991 Gossiper.java:987 - Node /192.168.2.102<http://192.168.2.102> has restarted, now UP INFO [SharedPool-Worker-1] 2015-06-23 11:45:33,992 Gossiper.java:954 - InetAddress /192.168.2.102<http://192.168.2.102> is now UP INFO [HANDSHAKE-/192.168.2.102<http://192.168.2.102>] 2015-06-23 11:45:33,993 OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.102<http://192.168.2.102> INFO [GossipStage:1] 2015-06-23 11:45:33,993 StorageService.java:1642 - Node /192.168.2.102<http://192.168.2.102> state jump to normal INFO [GossipStage:1] 2015-06-23 11:45:33,999 Gossiper.java:987 - Node /192.168.2.103<http://192.168.2.103> has restarted, now UP INFO [SharedPool-Worker-1] 2015-06-23 11:45:33,999 Gossiper.java:954 - InetAddress /192.168.2.103<http://192.168.2.103> is now UP INFO [GossipStage:1] 2015-06-23 11:45:34,001 StorageService.java:1642 - Node /192.168.2.103<http://192.168.2.103> state jump to normal INFO [HANDSHAKE-/192.168.2.103<http://192.168.2.103>] 2015-06-23 11:45:34,020 OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.103<http://192.168.2.103> INFO [main] 2015-06-23 11:45:34,021 StorageService.java:1642 - Node zennode0/192.168.2.100<http://192.168.2.100> state jump to normal INFO [GossipStage:1] 2015-06-23 11:45:34,028 StorageService.java:1642 - Node /192.168.2.104<http://192.168.2.104> state jump to normal INFO [main] 2015-06-23 11:45:34,038 CassandraDaemon.java:583 - Waiting for gossip to settle before accepting client requests... INFO [GossipStage:1] 2015-06-23 11:45:34,039 StorageService.java:1642 - Node /192.168.2.101<http://192.168.2.101> state jump to normal INFO [GossipStage:1] 2015-06-23 11:45:34,047 StorageService.java:1642 - Node /192.168.2.103<http://192.168.2.103> state jump to normal INFO [GossipStage:1] 2015-06-23 11:45:34,055 StorageService.java:1642 - Node /192.168.2.102<http://192.168.2.102> state jump to normal INFO [CompactionExecutor:1] 2015-06-23 11:45:34,062 CompactionTask.java:270 - Compacted 1 sstables to [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2,]. 171,593 bytes to 171,593 (~100% of orig inal) in 1,544ms = 0.105987MB/s. 2 total partitions merged to 2. Partition merge counts were {1:2, } INFO [CompactionExecutor:4] 2015-06-23 11:45:34,063 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2-Data.db')] INFO [GossipStage:1] 2015-06-23 11:45:34,066 StorageService.java:1642 - Node /192.168.2.102<http://192.168.2.102> state jump to normal INFO [MemtableFlushWriter:2] 2015-06-23 11:45:34,140 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data.db (8246 bytes) for commitlog position ReplayPosition(segmentId=1435052707645, position=144120) INFO [CompactionExecutor:2] 2015-06-23 11:45:34,140 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-11-Dat a.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-10-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_col umnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-9-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data .db')] INFO [InternalResponseStage:1] 2015-06-23 11:45:34,140 ColumnFamilyStore.java:882 - Enqueuing flush of schema_columns: 284282 (0%) on-heap, 0 (0%) off-heap INFO [MemtableFlushWriter:1] 2015-06-23 11:45:34,141 Memtable.java:346 - Writing Memtable-schema_columns@813508218(55024 serialized bytes, 1041 ops, 0%/0% of on/off-heap limit) INFO [CompactionExecutor:4] 2015-06-23 11:45:37,980 CompactionTask.java:270 - Compacted 1 sstables to [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-3,]. 171,593 bytes to 171,593 (~100% of orig inal) in 3,916ms = 0.041789MB/s. 2 total partitions merged to 2. Partition merge counts were {1:2, } INFO [MemtableFlushWriter:1] 2015-06-23 11:45:39,265 Memtable.java:385 - Completed flushing /home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db (12966 bytes) for commitlog position ReplayPosition(segmentId=1435052707645, position=251695) INFO [CompactionExecutor:1] 2015-06-23 11:45:39,266 CompactionTask.java:140 - Compacting [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-10-Data.db'), SSTabl eReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-11-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17 d3df2122a/system-schema_columns-ka-9-Data.db'), SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db')] INFO [InternalResponseStage:1] 2015-06-23 11:45:39,271 ColumnFamilyStore.java:882 - Enqueuing flush of schema_triggers: 154 (0%) on-heap, 0 (0%) off-heap INFO [MemtableFlushWriter:2] 2015-06-23 11:45:39,271 Memtable.java:346 - Writing Memtable-schema_triggers@652166139(8 serialized bytes, 1 ops, 0%/0% of on/off-heap limit) INFO [CompactionExecutor:2] 2015-06-23 11:45:39,275 CompactionTask.java:270 - Compacted 4 sstables to [/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-13,]. 17,971 bytes to 8,246 (~45% of original) in 5,134ms = 0.001532MB/s. 8 total partitions merged to 3. Partition merge counts were {2:2, 4:1, } INFO [main] 2015-06-23 11:45:42,042 CassandraDaemon.java:615 - No gossip backlog; proceeding I don’t know what the warning means. WARN [GossipTasks:1] 2015-06-23 11:45:33,459 FailureDetector.java:251 - Not marking nodes down due to local pause of 25509152054 > 5000000000 And shortly after the proceeding… INFO [HANDSHAKE-zennode0/192.168.2.100<http://192.168.2.100>] 2015-06-23 11:46:00,566 OutboundTcpConnection.java:485 - Handshaking version with zennode0/192.168.2.100<http://192.168.2.100> WARN [MessagingService-Incoming-/192.168.2.103<http://192.168.2.103>] 2015-06-23 11:46:00,595 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) ~[apache-cassandra-2.1.6.jar:2.1.6] WARN [MessagingService-Incoming-/192.168.2.102<http://192.168.2.102>] 2015-06-23 11:46:00,629 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) ~[apache-cassandra-2.1.6.jar:2.1.6] WARN [MessagingService-Incoming-/192.168.2.103<http://192.168.2.103>] 2015-06-23 11:46:00,633 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) ~[apache-cassandra-2.1.6.jar:2.1.6] That seems to me more nasty. I don’t know what it means, but I have the impression that the nodes which I did not touch seem now to talk with node0 about a table which the latter does not know about!!! A table node0 “forgot”. Well I don’t ask for miracle here. If I would have noticed that there was already a problem after restarted the first node I would have simply fixed that node before doing anything else… but I did not noticed this. Thanks for your help... On 23 Jun 2015, at 19:24 , Alain RODRIGUEZ <arodr...@gmail.com<mailto:arodr...@gmail.com>> wrote: Hi Jean, "I had to reboot a node. I killed the cassandra process on that node". You should drain the node before killing java (or using any service stop command). This is not what causes the issue yet it will help you to keep consistence if you use counters, and make the reboot faster in any cases. What is going on highly depends on what you did before. Did you change the RF ? Did you change the Topology ? Are you sure this node had data before you restart it ? What does a "nodetool status mykeyspace" outputs ? To make the join faster you could try to bootstrap the node again. I just hope you have a RF > 1 (btw, you have one replica down, if you want to still have Reads / Writes working, take care of having a Consistency Level low enough). "It’s like the whole cluster is paralysed" --> what does it mean, be more accurate on this please. You should tell us action that were taken before this occurred and now what is not working since a C* cluster in this state could perfectly run. No SPOF. C*heers 2015-06-23 16:10 GMT+02:00 Jean Tremblay <jean.tremb...@zen-innovations.com<mailto:jean.tremb...@zen-innovations.com>>: Does anyone know what to do when such an event occurs? Does anyone know how this could happen? I would have tried repairing the node with nodetool repair but that takes much too long… I need my cluster to work for our online system. At this point nothing is working. It’s like the whole cluster is paralysed. The only solution I see is to remove temporarily that node. Thanks for your comments. On 23 Jun 2015, at 12:45 , Jean Tremblay <jean.tremb...@zen-innovations.com<mailto:jean.tremb...@zen-innovations.com>> wrote: Hi, I have a cluster with 5 nodes running Cassandra 2.1.6. I had to reboot a node. I killed the cassandra process on that node. Rebooted the machine, and restarted Cassandra. ~/apache-cassandra-DATA/data:321>nodetool status Datacenter: datacenter1 ======================= Status=Up/Down |/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 192.168.2.104 158.27 GB 256 ? 6479205e-6a19-49a8-b1a1-7e788ec29caa rack1 UN 192.168.2.100 4.75 GB 256 ? e821da50-23c6-4ea0-b3a1-275ded63bc1f rack1 UN 192.168.2.101 157.43 GB 256 ? 01525665-bacc-4207-a8c3-eb4fd9532401 rack1 UN 192.168.2.102 159.27 GB 256 ? 596a33d7-5089-4c7e-a9ad-e1f22111b160 rack1 UN 192.168.2.103 167 GB 256 ? 0ce1d48e-57a9-4615-8e12-d7ef3d621c7d rack1 After restarting node 192.168.2.100 I noticed that its load was diminish to 2%. And now when I query the cluster my queries are bombing and that node times out with an error WARN [MessagingService-Incoming-/192.168.2.102<http://192.168.2.102/>] 2015-06-23 12:26:00,056 IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from socket; closing org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 at org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302) ~[apache-cassandra-2.1.6.jar:2.1.6] at org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330) ~[apache-cassandra-2.1.6.jar:2.1.6] What is going on? Did anyone live something like that?