Hi,

I am running repair on production, started with one of 6 nodes in the
cluster (3 nodes in each of two DC). Cassandra version 3.0.14.

running: repair -pr --full keyspace on node 1, 1TB data, takes two days,
and crash,

error shows:
3202]] finished (progress: 3%)
Exception occurred during clean-up.
java.lang.reflect.UndeclaredThrowableException
Cassandra has shutdown.
error: [2019-07-31 20:19:20,797] JMX connection closed. You should check
server log for repair status of keyspace keyspace_masked (Subsequent
keyspaces are not going to be repaired).
-- StackTrace --
java.io.IOException: [2019-07-31 20:19:20,797] JMX connection closed. You
should check server log for repair status of keyspace keyspace_masked
keyspaces are not going to be repaired).
        at
org.apache.cassandra.tools.RepairRunner.handleConnectionFailed(RepairRunner.java:97)
        at
org.apache.cassandra.tools.RepairRunner.handleConnectionClosed(RepairRunner.java:91)
        at
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:90)
        at
javax.management.NotificationBroadcasterSupport.handleNotification(NotificationBroadcasterSupport.java:275)
        at
javax.management.NotificationBroadcasterSupport$SendNotifJob.run(NotificationBroadcasterSupport.java:352)
        at
javax.management.NotificationBroadcasterSupport$1.execute(NotificationBroadcasterSupport.java:337)
        at
javax.management.NotificationBroadcasterSupport.sendNotification(NotificationBroadcasterSupport.java:248)
        at
javax.management.remote.rmi.RMIConnector.sendNotification(RMIConnector.java:441)
        at
javax.management.remote.rmi.RMIConnector.close(RMIConnector.java:533)
        at
javax.management.remote.rmi.RMIConnector.access$1300(RMIConnector.java:121)
        at
javax.management.remote.rmi.RMIConnector$RMIClientCommunicatorAdmin.gotIOException(RMIConnector.java:1534)
        at
javax.management.remote.rmi.RMIConnector$RMINotifClient.fetchNotifs(RMIConnector.java:1352)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchOneNotif(ClientNotifForwarder.java:655)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.fetchNotifs(ClientNotifForwarder.java:607)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(ClientNotifForwarder.java:471)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(ClientNotifForwarder.java:452)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(ClientNotifForwarder.java:108)

system.log shows
INFO  [Service Thread] 2019-07-31 20:19:08,579 GCInspector.java:284 - G1
Young Generation GC in 2915ms.  G1 Eden Space: 914358272 -> 0; G1 Old Gen:
19043999248 -> 20219035248;
INFO  [Service Thread] 2019-07-31 20:19:08,579 StatusLogger.java:52 - Pool
Name                    Active   Pending      Completed   Blocked  All Time
Blocked
INFO  [Service Thread] 2019-07-31 20:19:08,584 StatusLogger.java:56 -
MutationStage                    19        15     9578177305         0
            0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
ViewMutationStage                 0         0              0         0
            0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
ReadStage                        10         0      219357504         0
            0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
RequestResponseStage              1         0      625174550         0
            0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
ReadRepairStage                   0         0        2544772         0
            0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
CounterMutationStage              0         0              0         0
            0

INFO  [Service Thread] 2019-07-31 20:19:08,585 StatusLogger.java:56 -
MiscStage                         0         0              0         0
            0

INFO  [Service Thread] 2019-07-31 20:19:08,586 StatusLogger.java:56 -
CompactionExecutor                1         1        9515493         0
            0


When I restart the cassandra, it still failed,
now the error in system.log shows:

INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:575 - Cassandra
version: 3.0.14
INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:576 - Thrift API
version: 20.1.0
INFO  [main] 2019-07-31 21:35:02,044 StorageService.java:577 - CQL
supported versions: 3.4.0 (default: 3.4.0)
ERROR [main] 2019-07-31 21:35:02,075 CassandraDaemon.java:710 - Exception
encountered during startup
org.apache.cassandra.io.FSReadError: java.io.EOFException
        at
org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:142)
~[apache-cassandra-3.0.14.jar:3.0.14]
        at
java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
~[na:1.8.0_171]
        at
java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175)
~[na:1.8.0_171]
        at java.util.Iterator.forEachRemaining(Iterator.java:116)
~[na:1.8.0_171]
        at
java.util.Spliterators$IteratorSpliterator.forEachRemaining(Spliterators.java:1801)
~[na:1.8.0_171]
        at
java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481)
~[na:1.8.0_171]
        at
java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
~[na:1.8.0_171]
        at
java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
~[na:1.8.0_171]
        at
java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
~[na:1.8.0_171]
        at
java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
~[na:1.8.0_171]
        at
org.apache.cassandra.hints.HintsCatalog.load(HintsCatalog.java:65)
~[apache-cassandra-3.0.14.jar:3.0.14]
        at
org.apache.cassandra.hints.HintsService.<init>(HintsService.java:88)
~[apache-cassandra-3.0.14.jar:3.0.14]
        at
org.apache.cassandra.hints.HintsService.<clinit>(HintsService.java:63)
~[apache-cassandra-3.0.14.jar:3.0.14]
        at
org.apache.cassandra.service.StorageProxy.<clinit>(StorageProxy.java:121)
~[apache-cassandra-3.0.14.jar:3.0.14]
        at java.lang.Class.forName0(Native Method) ~[na:1.8.0_171]
        at java.lang.Class.forName(Class.java:264) ~[na:1.8.0_171]
        at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:585)
~[apache-cassandra-3.0.14.jar:3.0.14]
        at
org.apache.cassandra.service.StorageService.initServer(StorageService.java:570)
~[apache-cassandra-3.0.14.jar:3.0.14]
        at
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:346)
[apache-cassandra-3.0.14.jar:3.0.14]
        at
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:569)
[apache-cassandra-3.0.14.jar:3.0.14]
        at
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:697)
[apache-cassandra-3.0.14.jar:3.0.14]
Caused by: java.io.EOFException: null
        at java.io.RandomAccessFile.readInt(RandomAccessFile.java:803)
~[na:1.8.0_171]
        at
org.apache.cassandra.hints.HintsDescriptor.deserialize(HintsDescriptor.java:237)
~[apache-cassandra-3.0.14.jar:3.0.14]
        at
org.apache.cassandra.hints.HintsDescriptor.readFromFile(HintsDescriptor.java:138)
~[apache-cassandra-3.0.14.jar:3.0.14]
        ... 20 common frames omitted


Can anyone help how to bring back the node again?

Also there are (anti-compaction after repair) running on other nodes, shall
I stopped them as well, if so how to do it (nodetool stop compaction?)?

Any suggestions will be much appreciated.

Thanks
Regards
Martin

Reply via email to