Hello,

today the second time our weekly repair job failed which was working for
many month without a problem. We are having multiple Cassandra nodes in two
data center.

The repair command is started only on one node with the following
parameters:

nodetool repair -full -dcpar

Is it problematic if the repair is started only on one node?

The repair fails after one hour with the following error message:

 failed with error Could not create snapshot at /192.168.13.232 (progress:
0%)
[2019-12-28 05:00:04,295] Some repair failed
[2019-12-28 05:00:04,296] Repair command #1 finished in 1 hour 0 minutes 2
seconds
error: Repair job has failed with the error message: [2019-12-28
05:00:04,295] Some repair failed
-- StackTrace --
java.lang.RuntimeException: Repair job has failed with the error message:
[2019-12-28 05:00:04,295] Some repair failed
        at
org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116)
        at
org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(Unknown
Source)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(Unknown
Source)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(Unknown
Source)
        at
com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(Unknown
Source)

In the logfile on 192.168.13.232 which is in the second data center I could
find only in debug.log the following log messages:
DEBUG [COMMIT-LOG-ALLOCATOR] 2019-12-28 04:21:20,143
AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating
a fresh one
DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
04:31:00,450 OutboundTcpConnection.java:410 - Socket to 192.168.13.120
 closed
DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28
04:31:00,450 OutboundTcpConnection.java:349 - Error writing to 192.168
.13.120
java.io.IOException: Connection timed out
        at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
~[na:1.8.0_111]

We tried to run repair a few more times but it always failed with the same
error. After restarting all nodes it was finally successful.

Any idea what could be wrong?

Regards
Oliver

Reply via email to