Hi Alex, Thanks, much appreciated.
Regards Martin On Thu, Aug 1, 2019 at 3:34 PM Alexander Dejanovski <a...@thelastpickle.com> wrote: > Hi Martin, > > apparently this is the bug you've been hit by on hints : > https://issues.apache.org/jira/browse/CASSANDRA-14080 > It was fixed in 3.0.17. > > You didn't provide the logs from Cassandra at the time of the crash, only > the output of nodetool, so it's hard to say what caused it. You may be hit > by this bug: https://issues.apache.org/jira/browse/CASSANDRA-14096 > This is unlikely to happen with Reaper (as mentioned in the description of > the ticket) since it will generate smaller Merkle trees as subrange covers > less partitions for each repair session. > > So the advice is : upgrade to 3.0.19 (even 3.11.4 IMHO as 3.0 offers less > performance than 3.11) and use Reaper <http://cassandra-reaper.io/> to > handle/schedule repairs. > > Cheers, > > ----------------- > Alexander Dejanovski > France > @alexanderdeja > > Consultant > Apache Cassandra Consulting > http://www.thelastpickle.com > > > On Thu, Aug 1, 2019 at 12:05 AM Martin Xue <martin...@gmail.com> wrote: > >> Hi Alex, >> >> Thanks for your reply. The disk space was around 80%. The crash happened >> during repair, primary range full repair on 1TB keyspace. >> >> Would that crash again? >> >> Thanks >> Regards >> Martin >> >> On Thu., 1 Aug. 2019, 12:04 am Alexander Dejanovski, < >> a...@thelastpickle.com> wrote: >> >>> It looks like you have a corrupted hint file. >>> Did the node run out of disk space while repair was running? >>> >>> You might want to move the hint files off their current directory and >>> try to restart the node again. >>> Since you'll have lost mutations then, you'll need... to run repair >>> ¯\_(ツ)_/¯ >>> >>> ----------------- >>> Alexander Dejanovski >>> France >>> @alexanderdeja >>> >>> Consultant >>> Apache Cassandra Consulting >>> http://www.thelastpickle.com >>> >>> >>> On Wed, Jul 31, 2019 at 3:51 PM Martin Xue <martin...@gmail.com> wrote: >>> >>>> Hi, >>>> >>>> I am running repair on production, started with one of 6 nodes in the >>>> cluster (3 nodes in each of two DC). Cassandra version 3.0.14. >>>> >>>> running: repair -pr --full keyspace on node 1, 1TB data, takes two >>>> days, and crash, >>>> >>>> error shows: >>>> 3202]] finished (progress: 3%) >>>> Exception occurred during clean-up. >>>> java.lang.reflect.UndeclaredThrowableException >>>> Cassandra has shutdown. >>>> error: [2019-07-31 20:19:20,797] JMX connection closed. You should >>>> check server log for repair status of keyspace keyspace_masked (Subsequent >>>> keyspaces are not going to be repaired). >>>> -- StackTrace -- >>>> java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. >>>> You should check server log for repair status of keyspace keyspace_masked >>>> keyspaces are not going to be repaired). >>>> at >>>> org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97) >>>> at >>>> org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91) >>>> at >>>> org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90) >>>> at >>>> javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275) >>>> at >>>> javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352) >>>> at >>>> javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337) >>>> at >>>> javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248) >>>> at >>>> javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441) >>>> at >>>> javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533) >>>> at >>>> javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121) >>>> at >>>> javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534) >>>> at >>>> javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352) >>>> at >>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655) >>>> at >>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607) >>>> at >>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471) >>>> at >>>> com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452) >>>> at >>>> com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108) >>>> >>>> system.log shows >>>> INFO [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - >>>> G1 Young Generation GC in 2915ms. G1 Eden Space: 914358272 -> 0; G1 Old >>>> Gen: 19043999248 -> 20219035248; >>>> INFO [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 - >>>> Pool Name Active Pending Completed Blocked All >>>> Time Blocked >>>> INFO [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 - >>>> MutationStage 19 15 9578177305 0 >>>> 0 >>>> >>>> INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - >>>> ViewMutationStage 0 0 0 0 >>>> 0 >>>> >>>> INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - >>>> ReadStage 10 0 219357504 0 >>>> 0 >>>> >>>> INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - >>>> RequestResponseStage 1 0 625174550 0 >>>> 0 >>>> >>>> INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - >>>> ReadRepairStage 0 0 2544772 0 >>>> 0 >>>> >>>> INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - >>>> CounterMutationStage 0 0 0 0 >>>> 0 >>>> >>>> INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - >>>> MiscStage 0 0 0 0 >>>> 0 >>>> >>>> INFO [Service Thread] 2019-07-31 20:19:08,586 StatusLogger.java:56 - >>>> CompactionExecutor 1 1 9515493 0 >>>> 0 >>>> >>>> >>>> When I restart the cassandra, it still failed, >>>> now the error in system.log shows: >>>> >>>> INFO [main] 2019-07-31 21:35:02,044 StorageService.java:575 - >>>> Cassandra version: 3.0.14 >>>> INFO [main] 2019-07-31 21:35:02,044 StorageService.java:576 - Thrift >>>> API version: 20.1.0 >>>> INFO [main] 2019-07-31 21:35:02,044 StorageService.java:577 - CQL >>>> supported versions: 3.4.0 (default: 3.4.0) >>>> ERROR [main] 2019-07-31 21:35:02,075 CassandraDaemon.java:710 - >>>> Exception encountered during startup >>>> org.apache.cassandra.io.FSReadError: java.io.EOFException >>>> at >>>> org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) >>>> ~[na:1.8.0_171] >>>> at >>>> java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) >>>> ~[na:1.8.0_171] >>>> at java.util.Iterator.forEachRemaining(Iterator.java:116) >>>> ~[na:1.8.0_171] >>>> at >>>> java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) >>>> ~[na:1.8.0_171] >>>> at >>>> java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) >>>> ~[na:1.8.0_171] >>>> at >>>> java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) >>>> ~[na:1.8.0_171] >>>> at >>>> java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) >>>> ~[na:1.8.0_171] >>>> at >>>> java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) >>>> ~[na:1.8.0_171] >>>> at >>>> java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) >>>> ~[na:1.8.0_171] >>>> at >>>> org.apache.cassandra.hints.HintsCatalog.load(HintsCatalog.java:65) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> org.apache.cassandra.hints.HintsService.<init>(HintsService.java:88) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> org.apache.cassandra.hints.HintsService.<clinit>(HintsService.java:63) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> org.apache.cassandra.service.StorageProxy.<clinit>(StorageProxy.java:121) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> at java.lang.Class.forName0(Native Method) ~[na:1.8.0_171] >>>> at java.lang.Class.forName(Class.java:264) ~[na:1.8.0_171] >>>> at >>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:585) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> org.apache.cassandra.service.StorageService.initServer(StorageService.java:570) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:346) >>>> [apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569) >>>> [apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697) >>>> [apache-cassandra-3.0.14.jar:3.0.14] >>>> Caused by: java.io.EOFException: null >>>> at java.io.RandomAccessFile.readInt(RandomAccessFile.java:803) >>>> ~[na:1.8.0_171] >>>> at >>>> org.apache.cassandra.hints.HintsDescriptor.deserialize(HintsDescriptor.java:237) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> at >>>> org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:138) >>>> ~[apache-cassandra-3.0.14.jar:3.0.14] >>>> ... 20 common frames omitted >>>> >>>> >>>> Can anyone help how to bring back the node again? >>>> >>>> Also there are (anti-compaction after repair) running on other nodes, >>>> shall I stopped them as well, if so how to do it (nodetool stop >>>> compaction?)? >>>> >>>> Any suggestions will be much appreciated. >>>> >>>> Thanks >>>> Regards >>>> Martin >>>> >>>> >>>>