In the system logs, on the node where repair was initiated, I see that the node has requested merkle tree from all nodes including itself
INFO [Repair#3:1] 2022-01-14 03:32:18,805 RepairJob.java:172 - *[repair #6e3385e0-74d1-11ec-8e66-9f084ace9968*] Requesting merkle trees for *tablename* (to [*/xyz.abc.def.14, /xyz.abc.def.13, /xyz.abc.def.12, /xyz.mkn.pq.18, /xyz.mkn.pq.16, /xyz.mkn.pq.17*]) INFO [AntiEntropyStage:1] 2022-01-14 03:32:18,841 RepairSession.java:180 - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for *tablename* from */xyz.mkn.pq.17* INFO [AntiEntropyStage:1] 2022-01-14 03:32:18,847 RepairSession.java:180 - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for *tablename* from */xyz.mkn.pq.16* INFO [AntiEntropyStage:1] 2022-01-14 03:32:18,851 RepairSession.java:180 - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for *tablename* from */xyz.mkn.pq.18* INFO [AntiEntropyStage:1] 2022-01-14 03:32:18,856 RepairSession.java:180 - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for *tablename* from */xyz.abc.def.14* Line 2480: INFO [AntiEntropyStage:1] *2022-01-14 03:32:18*,876 RepairSession.java:180 - [*repair #6e3385e0-74d1-11ec-8e66-9f084ace9968*] Received merkle tree for *tablename* from */xyz.abc.def.12* As per the logs merkle tree is not received from node with ip *xyz.abc.def.13* In the system logs of node with ip *xyz.abc.def.13, *I can see following logs NFO [AntiEntropyStage:1] *2022-01-14 03:32:18*,850 Validator.java:281 - [*repair #6e3385e0-74d1-11ec-8e66-9f084ace9968*] Sending completed merkle tree to */* *xyz.mkn.pq.17* for *keyspace.tablename* >From the above I inferred that the repair task has become orphaned since it is waiting for merkle tree from a node and it is not going to receive it since it has been lost in the network somewhere between. Regards Manish On Tue, Jan 18, 2022 at 4:39 PM Bowen Song <bo...@bso.ng> wrote: > The entry in the debug.log is not specific to a repair session, and it > could also be caused by reasons other than network connectivity issue, such > as long STW GC pauses. I usually don't start troubleshooting an issue from > the debug log, as it can be rather noisy. The system.log is a better > starting point. > > If I was to troubleshoot the issue, I would start from the system logs on > the node that initiated the repair, i.e. the node you ran the "nodetool > repair" command on. Follow the repair ID (an UUID) in the logs on all nodes > involved in the repair and read all related logs in chronological order to > find out what exactly had happened. > > BTW, If the issue is easily reproducible, I would re-run the repair with a > reduce scope (such as table and token range) to get less logs related to > the repair session. Less logs means less time spend on reading and > analysing them. > > Hope this helps. > On 18/01/2022 10:03, manish khandelwal wrote: > > I have a Cassandra 3.11.2 cluster with two DCs. While running repair , I > am observing the following behavior. > > I am seeing that node is not able to receive merkle tree from one or two > nodes. Also I am able to see that the missing nodes did send the merkle > tree but it was not received. This make repair hangs on consistent basis. > In netstats I can see output as follows > > *Mode: NORMAL* > *Not sending any streams. Attempted: 7858888* > *Mismatch (Blocking): 2560* > *Mismatch (Background): 17173* > *Pool Name Active Pending Completed Dropped* > *Large messages n/a 0 6313 3* > *Small messages n/a 0 55978004 3* > *Gossip messages n/a 0 93756 125**Does it represent network issues? In > Debug logs I saw something*DEBUG > [MessagingService-Outgoing-hostname/xxx.yy.zz.kk-Large] 2022-01-14 > 05:00:19,031 OutboundTcpConnection.java:349 - Error writing to > hostname/xxx.yy.zz.kk > java.io.IOException: Connection timed out > at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_221] > at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47) > ~[na:1.8.0_221] > at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_221] > at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_221] > at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471) > ~[na:1.8.0_221] > at java.nio.channels.Channels.writeFullyImpl(Channels.java:78) > ~[na:1.8.0_221] > at java.nio.channels.Channels.writeFully(Channels.java:98) ~[na:1.8.0_221] > at java.nio.channels.Channels.access$000(Channels.java:61) ~[na:1.8.0_221] > at java.nio.channels.Channels$1.write(Channels.java:174) ~[na:1.8.0_221] > at > net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205) > ~[lz4-1.3.0.jar:na] > at > net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158) > ~[lz4-1.3.0.jar:na] (edited) > > Does this show any network fluctuations? > > Regards > Manish > > >