Hello, today the second time our weekly repair job failed which was working for many month without a problem. We are having multiple Cassandra nodes in two data center.
The repair command is started only on one node with the following parameters: nodetool repair -full -dcpar Is it problematic if the repair is started only on one node? The repair fails after one hour with the following error message: failed with error Could not create snapshot at /192.168.13.232 (progress: 0%) [2019-12-28 05:00:04,295] Some repair failed [2019-12-28 05:00:04,296] Repair command #1 finished in 1 hour 0 minutes 2 seconds error: Repair job has failed with the error message: [2019-12-28 05:00:04,295] Some repair failed -- StackTrace -- java.lang.RuntimeException: Repair job has failed with the error message: [2019-12-28 05:00:04,295] Some repair failed at org.apache.cassandra.tools.RepairRunner.progress(RepairRunner.java:116) at org.apache.cassandra.utils.progress.jmx.JMXNotificationProgressListener.handleNotification(JMXNotificationProgressListener.java:77) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.dispatchNotification(Unknown Source) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.doRun(Unknown Source) at com.sun.jmx.remote.internal.ClientNotifForwarder$NotifFetcher.run(Unknown Source) at com.sun.jmx.remote.internal.ClientNotifForwarder$LinearExecutor$1.run(Unknown Source) In the logfile on 192.168.13.232 which is in the second data center I could find only in debug.log the following log messages: DEBUG [COMMIT-LOG-ALLOCATOR] 2019-12-28 04:21:20,143 AbstractCommitLogSegmentManager.java:109 - No segments in reserve; creating a fresh one DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28 04:31:00,450 OutboundTcpConnection.java:410 - Socket to 192.168.13.120 closed DEBUG [MessagingService-Outgoing-192.168.13.120-Small] 2019-12-28 04:31:00,450 OutboundTcpConnection.java:349 - Error writing to 192.168 .13.120 java.io.IOException: Connection timed out at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_111] We tried to run repair a few more times but it always failed with the same error. After restarting all nodes it was finally successful. Any idea what could be wrong? Regards Oliver