The entry in the debug.log is not specific to a repair session, and it
could also be caused by reasons other than network connectivity issue,
such as long STW GC pauses. I usually don't start troubleshooting an
issue from the debug log, as it can be rather noisy. The system.log is a
better starting point.
If I was to troubleshoot the issue, I would start from the system logs
on the node that initiated the repair, i.e. the node you ran the
"nodetool repair" command on. Follow the repair ID (an UUID) in the logs
on all nodes involved in the repair and read all related logs in
chronological order to find out what exactly had happened.
BTW, If the issue is easily reproducible, I would re-run the repair with
a reduce scope (such as table and token range) to get less logs related
to the repair session. Less logs means less time spend on reading and
analysing them.
Hope this helps.
On 18/01/2022 10:03, manish khandelwal wrote:
I have a Cassandra 3.11.2 cluster with two DCs. While running repair ,
I am observing the following behavior.
I am seeing that node is not able to receive merkle tree from one or
two nodes. Also I am able to see that the missing nodes did send the
merkle tree but it was not received. This make repair hangs on
consistent basis. In netstats I can see output as follows
*Mode: NORMAL*
*Not sending any streams. Attempted: 7858888*
*Mismatch (Blocking): 2560*
*Mismatch (Background): 17173*
*Pool Name Active Pending Completed Dropped*
*Large messages n/a 0 6313 3*
*Small messages n/a 0 55978004 3*
*Gossip messages n/a 0 93756 125**Does it represent network issues? In
Debug logs I saw something*DEBUG
[MessagingService-Outgoing-hostname/xxx.yy.zz.kk-Large] 2022-01-14
05:00:19,031 OutboundTcpConnection.java:349 - Error writing to
hostname/xxx.yy.zz.kk
java.io.IOException: Connection timed out
at sun.nio.ch <http://sun.nio.ch/>.FileDispatcherImpl.write0(Native
Method) ~[na:1.8.0_221]
at sun.nio.ch
<http://sun.nio.ch/>.SocketDispatcher.write(SocketDispatcher.java:47)
~[na:1.8.0_221]
at sun.nio.ch
<http://sun.nio.ch/>.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
~[na:1.8.0_221]
at sun.nio.ch <http://sun.nio.ch/>.IOUtil.write(IOUtil.java:65)
~[na:1.8.0_221]
at sun.nio.ch
<http://sun.nio.ch/>.SocketChannelImpl.write(SocketChannelImpl.java:471)
~[na:1.8.0_221]
at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
~[na:1.8.0_221]
at java.nio.channels.Channels.writeFully(Channels.java:98) ~[na:1.8.0_221]
at java.nio.channels.Channels.access$000(Channels.java:61) ~[na:1.8.0_221]
at java.nio.channels.Channels$1.write(Channels.java:174) ~[na:1.8.0_221]
at
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
~[lz4-1.3.0.jar:na]
at
net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
~[lz4-1.3.0.jar:na] (edited)
Does this show any network fluctuations?
Regards
Manish