on the node 192.168.2.100, did you run repair after its status is UN? On Wed, Jun 24, 2015 at 2:46 AM, Jean Tremblay < jean.tremb...@zen-innovations.com> wrote:
> Dear Alain, > > Thank you for your reply. > > Ok, yes I did not drain. The cluster was loaded with tons of records, > and no new records were added since few weeks. Each node had a load of > about 160 GB as seen in the “nodetool status". I killed the > cassandradeamon, and restarted it. After cassandra was restarted, I could > see in the “nodetool status” a load of 5 GB!! > > I don’t use counters. > I use RF 3 on 5 nodes. I did not change the replication factor. > I have two types of read queries. One use QUORUM for read, and the other > use ONE for consistency level. > I did not change the Topology. > > Are you sure this node had data before you restart it ? > > > Actually the full story is: > > - I stopped node0(192.168.2.100), and I restarted it. > - I stopped node1(192.168.2.101). > - I made a nodetool status and I noticed that node0 was UN and had a load > 5 GB. I found this really weird because all the other ones had about > 160GB. I also saw that node1 was DN with a load of about 160GB. > - I restarted node1. > - I made a nodetool status and I noticed that node1 was UN and had a load > of also 5GB, it previously had a load of about 160GB. That I’m sure. > - Then my program could no longer query C*. Neither the QUORUM nor the ONE > consistency level statements could read data. > > What does a "nodetool status mykeyspace" outputs ? > > > I cannot try this anymore. I flushed the whole cluster, and I am > currently reloading everything. I was too much in a hurry. I have a demo > tomorrow, and I will manage to have it back before tomorrow. > > After my bad decision of flushing the cluster, I realised that I could > have bootstrapped again my two nodes. Learning by doing. > > "It’s like the whole cluster is paralysed" --> what does it mean, be > more accurate on this please. > > You should tell us action that were taken before this occurred and now > what is not working since a C* cluster in this state could perfectly run. > No SPOF. > > > What I did before? Well this cluster was basically idling. I was only > making lots of select on it. I was loaded since few weeks. > But what I noticed when I restarted node0 is the following: > > INFO [InternalResponseStage:1] 2015-06-23 11:45:32,723 > ColumnFamilyStore.java:882 - Enqueuing flush of schema_columnfamilies: > 131587 (0%) on-heap, 0 (0%) off-heap > INFO [MemtableFlushWriter:2] 2015-06-23 11:45:32,723 Memtable.java:346 - > Writing Memtable-schema_columnfamilies@917967643(34850 serialized bytes, > 585 ops, 0%/0% of on/off-heap limit) > >>>> WARN [GossipTasks:1] 2015-06-23 11:45:33,459 > FailureDetector.java:251 - Not marking nodes down due to local pause of > 25509152054 > 5000000000 > INFO [MemtableFlushWriter:1] 2015-06-23 11:45:33,982 Memtable.java:385 - > Completed flushing > /home/maia/apache-cassandra-DATA/data/system/local-7ad54392bcdd35a684174e047860b377/system-local-ka-11-Data.db > (5274 bytes) for commitlog position ReplayPos > ition(segmentId=1435052707645, position=144120) > INFO [GossipStage:1] 2015-06-23 11:45:33,985 StorageService.java:1642 - > Node /192.168.2.101 state jump to normal > INFO [GossipStage:1] 2015-06-23 11:45:33,991 Gossiper.java:987 - Node / > 192.168.2.102 has restarted, now UP > INFO [SharedPool-Worker-1] 2015-06-23 11:45:33,992 Gossiper.java:954 - > InetAddress /192.168.2.102 is now UP > INFO [HANDSHAKE-/192.168.2.102] 2015-06-23 11:45:33,993 > OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.102 > INFO [GossipStage:1] 2015-06-23 11:45:33,993 StorageService.java:1642 - > Node /192.168.2.102 state jump to normal > INFO [GossipStage:1] 2015-06-23 11:45:33,999 Gossiper.java:987 - Node / > 192.168.2.103 has restarted, now UP > INFO [SharedPool-Worker-1] 2015-06-23 11:45:33,999 Gossiper.java:954 - > InetAddress /192.168.2.103 is now UP > INFO [GossipStage:1] 2015-06-23 11:45:34,001 StorageService.java:1642 - > Node /192.168.2.103 state jump to normal > INFO [HANDSHAKE-/192.168.2.103] 2015-06-23 11:45:34,020 > OutboundTcpConnection.java:485 - Handshaking version with /192.168.2.103 > INFO [main] 2015-06-23 11:45:34,021 StorageService.java:1642 - Node > zennode0/192.168.2.100 state jump to normal > INFO [GossipStage:1] 2015-06-23 11:45:34,028 StorageService.java:1642 - > Node /192.168.2.104 state jump to normal > INFO [main] 2015-06-23 11:45:34,038 CassandraDaemon.java:583 - Waiting > for gossip to settle before accepting client requests... > INFO [GossipStage:1] 2015-06-23 11:45:34,039 StorageService.java:1642 - > Node /192.168.2.101 state jump to normal > INFO [GossipStage:1] 2015-06-23 11:45:34,047 StorageService.java:1642 - > Node /192.168.2.103 state jump to normal > INFO [GossipStage:1] 2015-06-23 11:45:34,055 StorageService.java:1642 - > Node /192.168.2.102 state jump to normal > INFO [CompactionExecutor:1] 2015-06-23 11:45:34,062 > CompactionTask.java:270 - Compacted 1 sstables to > [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2,]. > 171,593 bytes to 171,593 (~100% of orig > inal) in 1,544ms = 0.105987MB/s. 2 total partitions merged to 2. > Partition merge counts were {1:2, } > INFO [CompactionExecutor:4] 2015-06-23 11:45:34,063 > CompactionTask.java:140 - Compacting > [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-2-Data.db')] > INFO [GossipStage:1] 2015-06-23 11:45:34,066 StorageService.java:1642 - > Node /192.168.2.102 state jump to normal > INFO [MemtableFlushWriter:2] 2015-06-23 11:45:34,140 Memtable.java:385 - > Completed flushing > /home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data.db > (8246 bytes) > for commitlog position ReplayPosition(segmentId=1435052707645, > position=144120) > INFO [CompactionExecutor:2] 2015-06-23 11:45:34,140 > CompactionTask.java:140 - Compacting > [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-11-Dat > a.db'), > SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-10-Data.db'), > SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_col > umnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-9-Data.db'), > SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-12-Data > .db')] > INFO [InternalResponseStage:1] 2015-06-23 11:45:34,140 > ColumnFamilyStore.java:882 - Enqueuing flush of schema_columns: 284282 (0%) > on-heap, 0 (0%) off-heap > INFO [MemtableFlushWriter:1] 2015-06-23 11:45:34,141 Memtable.java:346 - > Writing Memtable-schema_columns@813508218(55024 serialized bytes, 1041 > ops, 0%/0% of on/off-heap limit) > INFO [CompactionExecutor:4] 2015-06-23 11:45:37,980 > CompactionTask.java:270 - Compacted 1 sstables to > [/home/maia/apache-cassandra-DATA/data/system/hints-2666e20573ef38b390fefecf96e8f0c7/system-hints-ka-3,]. > 171,593 bytes to 171,593 (~100% of orig > inal) in 3,916ms = 0.041789MB/s. 2 total partitions merged to 2. > Partition merge counts were {1:2, } > INFO [MemtableFlushWriter:1] 2015-06-23 11:45:39,265 Memtable.java:385 - > Completed flushing > /home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db > (12966 bytes) for commitlog > position ReplayPosition(segmentId=1435052707645, position=251695) > INFO [CompactionExecutor:1] 2015-06-23 11:45:39,266 > CompactionTask.java:140 - Compacting > [SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-10-Data.db'), > SSTabl > eReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-11-Data.db'), > SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17 > d3df2122a/system-schema_columns-ka-9-Data.db'), > SSTableReader(path='/home/maia/apache-cassandra-DATA/data/system/schema_columns-296e9c049bec3085827dc17d3df2122a/system-schema_columns-ka-12-Data.db')] > INFO [InternalResponseStage:1] 2015-06-23 11:45:39,271 > ColumnFamilyStore.java:882 - Enqueuing flush of schema_triggers: 154 (0%) > on-heap, 0 (0%) off-heap > INFO [MemtableFlushWriter:2] 2015-06-23 11:45:39,271 Memtable.java:346 - > Writing Memtable-schema_triggers@652166139(8 serialized bytes, 1 ops, > 0%/0% of on/off-heap limit) > INFO [CompactionExecutor:2] 2015-06-23 11:45:39,275 > CompactionTask.java:270 - Compacted 4 sstables to > [/home/maia/apache-cassandra-DATA/data/system/schema_columnfamilies-45f5b36024bc3f83a3631034ea4fa697/system-schema_columnfamilies-ka-13,]. > 17,971 > bytes to 8,246 (~45% of original) in 5,134ms = 0.001532MB/s. 8 total > partitions merged to 3. Partition merge counts were {2:2, 4:1, } > INFO [main] 2015-06-23 11:45:42,042 CassandraDaemon.java:615 - No gossip > backlog; proceeding > > > I don’t know what the warning means. WARN [GossipTasks:1] > 2015-06-23 11:45:33,459 FailureDetector.java:251 - Not marking nodes down > due to local pause of 25509152054 > 5000000000 > > And shortly after the proceeding… > > INFO [HANDSHAKE-zennode0/192.168.2.100] 2015-06-23 11:46:00,566 > OutboundTcpConnection.java:485 - Handshaking version with zennode0/ > 192.168.2.100 > WARN [MessagingService-Incoming-/192.168.2.103] 2015-06-23 11:46:00,595 > IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from > socket; closing > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 > at > org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) > ~[apache-cassandra-2.1.6.jar:2.1.6] > WARN [MessagingService-Incoming-/192.168.2.102] 2015-06-23 11:46:00,629 > IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from > socket; closing > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 > at > org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:97) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ReadResponseSerializer.deserialize(ReadResponse.java:69) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at org.apache.cassandra.net.MessageIn.read(MessageIn.java:99) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:188) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:170) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:88) > ~[apache-cassandra-2.1.6.jar:2.1.6] > WARN [MessagingService-Incoming-/192.168.2.103] 2015-06-23 11:46:00,633 > IncomingTcpConnection.java:97 - UnknownColumnFamilyException reading from > socket; closing > org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find > cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 > at > org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at > org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:89) > ~[apache-cassandra-2.1.6.jar:2.1.6] > at org.apache.cassandra.db.Row$RowSerializer.deserialize(Row.java:74) > ~[apache-cassandra-2.1.6.jar:2.1.6] > > > That seems to me more nasty. I don’t know what it means, but I have the > impression that the nodes which I did not touch seem now to talk with node0 > about a table which the latter does not know about!!! A table node0 > “forgot”. > > > Well I don’t ask for miracle here. If I would have noticed that there > was already a problem after restarted the first node I would have simply > fixed that node before doing anything else… but I did not noticed this. > > Thanks for your help... > > > > > On 23 Jun 2015, at 19:24 , Alain RODRIGUEZ <arodr...@gmail.com> wrote: > > Hi Jean, > > "I had to reboot a node. I killed the cassandra process on that node". > You should drain the node before killing java (or using any service stop > command). This is not what causes the issue yet it will help you to keep > consistence if you use counters, and make the reboot faster in any cases. > > What is going on highly depends on what you did before. > > Did you change the RF ? > Did you change the Topology ? > Are you sure this node had data before you restart it ? > What does a "nodetool status mykeyspace" outputs ? > > To make the join faster you could try to bootstrap the node again. I > just hope you have a RF > 1 (btw, you have one replica down, if you want to > still have Reads / Writes working, take care of having a Consistency Level > low enough). > > "It’s like the whole cluster is paralysed" --> what does it mean, be > more accurate on this please. > > You should tell us action that were taken before this occurred and now > what is not working since a C* cluster in this state could perfectly run. > No SPOF. > > C*heers > > 2015-06-23 16:10 GMT+02:00 Jean Tremblay < > jean.tremb...@zen-innovations.com>: > >> Does anyone know what to do when such an event occurs? >> Does anyone know how this could happen? >> >> I would have tried repairing the node with nodetool repair but that >> takes much too long… I need my cluster to work for our online system. At >> this point nothing is working. It’s like the whole cluster is paralysed. >> The only solution I see is to remove temporarily that node. >> >> Thanks for your comments. >> >> On 23 Jun 2015, at 12:45 , Jean Tremblay < >> jean.tremb...@zen-innovations.com> wrote: >> >> Hi, >> >> I have a cluster with 5 nodes running Cassandra 2.1.6. >> >> I had to reboot a node. I killed the cassandra process on that node. >> Rebooted the machine, and restarted Cassandra. >> >> ~/apache-cassandra-DATA/data:321>nodetool status >> Datacenter: datacenter1 >> ======================= >> Status=Up/Down >> |/ State=Normal/Leaving/Joining/Moving >> -- Address Load Tokens Owns Host ID >> Rack >> UN 192.168.2.104 158.27 GB 256 ? >> 6479205e-6a19-49a8-b1a1-7e788ec29caa rack1 >> UN 192.168.2.100 4.75 GB 256 ? >> e821da50-23c6-4ea0-b3a1-275ded63bc1f rack1 >> UN 192.168.2.101 157.43 GB 256 ? >> 01525665-bacc-4207-a8c3-eb4fd9532401 rack1 >> UN 192.168.2.102 159.27 GB 256 ? >> 596a33d7-5089-4c7e-a9ad-e1f22111b160 rack1 >> UN 192.168.2.103 167 GB 256 ? >> 0ce1d48e-57a9-4615-8e12-d7ef3d621c7d rack1 >> >> >> After restarting node 192.168.2.100 I noticed that its load was >> diminish to 2%. And now when I query the cluster my queries are bombing and >> that node times out with an error >> >> WARN [MessagingService-Incoming-/192.168.2.102] 2015-06-23 >> 12:26:00,056 IncomingTcpConnection.java:97 - UnknownColumnFamilyException >> reading from socket; closing >> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find >> cfId=ddc346b0-1372-11e5-9ba1-195596ed1fd9 >> at >> org.apache.cassandra.db.ColumnFamilySerializer.deserializeCfId(ColumnFamilySerializer.java:164) >> ~[apache-cassandra-2.1.6.jar:2.1.6] >> at >> org.apache.cassandra.db.ColumnFamilySerializer.deserialize(ColumnFamilySerializer.java:97) >> ~[apache-cassandra-2.1.6.jar:2.1.6] >> at >> org.apache.cassandra.db.Mutation$MutationSerializer.deserializeOneCf(Mutation.java:322) >> ~[apache-cassandra-2.1.6.jar:2.1.6] >> at >> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:302) >> ~[apache-cassandra-2.1.6.jar:2.1.6] >> at >> org.apache.cassandra.db.Mutation$MutationSerializer.deserialize(Mutation.java:330) >> ~[apache-cassandra-2.1.6.jar:2.1.6] >> >> What is going on? Did anyone live something like that? >> >> >> > >