Hi, I am running repair on production, started with one of 6 nodes in the cluster (3 nodes in each of two DC). Cassandra version 3.0.14.
running: repair -pr --full keyspace on node 1, 1TB data, takes two days, and crash, error shows: 3202]] finished (progress: 3%) Exception occurred during clean-up. java.lang.reflect.UndeclaredThrowableException Cassandra has shutdown. error: [2019-07-31 20:19:20,797] JMX connection closed. You should check server log for repair status of keyspace keyspace_masked (Subsequent keyspaces are not going to be repaired). -- StackTrace -- java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. You should check server log for repair status of keyspace keyspace_masked keyspaces are not going to be repaired). at org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97) at org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91) at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90) at javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275) at javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352) at javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337) at javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248) at javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441) at javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533) at javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121) at javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534) at javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452) at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108) system.log shows INFO [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1 Young Generation GC in 2915ms. G1 Eden Space: 914358272 -> 0; G1 Old Gen: 19043999248 -> 20219035248; INFO [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 - Pool Name Active Pending Completed Blocked All Time Blocked INFO [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 - MutationStage 19 15 9578177305 0 0 INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - ViewMutationStage 0 0 0 0 0 INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - ReadStage 10 0 219357504 0 0 INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - RequestResponseStage 1 0 625174550 0 0 INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - ReadRepairStage 0 0 2544772 0 0 INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - CounterMutationStage 0 0 0 0 0 INFO [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 - MiscStage 0 0 0 0 0 INFO [Service Thread] 2019-07-31 20:19:08,586 StatusLogger.java:56 - CompactionExecutor 1 1 9515493 0 0 When I restart the cassandra, it still failed, now the error in system.log shows: INFO [main] 2019-07-31 21:35:02,044 StorageService.java:575 - Cassandra version: 3.0.14 INFO [main] 2019-07-31 21:35:02,044 StorageService.java:576 - Thrift API version: 20.1.0 INFO [main] 2019-07-31 21:35:02,044 StorageService.java:577 - CQL supported versions: 3.4.0 (default: 3.4.0) ERROR [main] 2019-07-31 21:35:02,075 CassandraDaemon.java:710 - Exception encountered during startup org.apache.cassandra.io.FSReadError: java.io.EOFException at org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142) ~[apache-cassandra-3.0.14.jar:3.0.14] at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[na:1.8.0_171] at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[na:1.8.0_171] at java.util.Iterator.forEachRemaining(Iterator.java:116) ~[na:1.8.0_171] at java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801) ~[na:1.8.0_171] at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[na:1.8.0_171] at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[na:1.8.0_171] at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708) ~[na:1.8.0_171] at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[na:1.8.0_171] at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499) ~[na:1.8.0_171] at org.apache.cassandra.hints.HintsCatalog.load(HintsCatalog.java:65) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.hints.HintsService.<init>(HintsService.java:88) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.hints.HintsService.<clinit>(HintsService.java:63) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.service.StorageProxy.<clinit>(StorageProxy.java:121) ~[apache-cassandra-3.0.14.jar:3.0.14] at java.lang.Class.forName0(Native Method) ~[na:1.8.0_171] at java.lang.Class.forName(Class.java:264) ~[na:1.8.0_171] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:585) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.service.StorageService.initServer(StorageService.java:570) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:346) [apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569) [apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697) [apache-cassandra-3.0.14.jar:3.0.14] Caused by: java.io.EOFException: null at java.io.RandomAccessFile.readInt(RandomAccessFile.java:803) ~[na:1.8.0_171] at org.apache.cassandra.hints.HintsDescriptor.deserialize(HintsDescriptor.java:237) ~[apache-cassandra-3.0.14.jar:3.0.14] at org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:138) ~[apache-cassandra-3.0.14.jar:3.0.14] ... 20 common frames omitted Can anyone help how to bring back the node again? Also there are (anti-compaction after repair) running on other nodes, shall I stopped them as well, if so how to do it (nodetool stop compaction?)? Any suggestions will be much appreciated. Thanks Regards Martin