Re: Hanging repairs in Cassandra

manish khandelwal Tue, 18 Jan 2022 03:56:34 -0800

In the system logs, on the node where repair was initiated, I see that the
node has requested merkle tree from all nodes including itself


INFO  [Repair#3:1] 2022-01-14 03:32:18,805 RepairJob.java:172 - *[repair
#6e3385e0-74d1-11ec-8e66-9f084ace9968*] Requesting merkle trees for
*tablename* (to [*/xyz.abc.def.14, /xyz.abc.def.13, /xyz.abc.def.12,
/xyz.mkn.pq.18, /xyz.mkn.pq.16, /xyz.mkn.pq.17*])
INFO  [AntiEntropyStage:1] 2022-01-14 03:32:18,841 RepairSession.java:180 -
[repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for
*tablename* from */xyz.mkn.pq.17*
INFO  [AntiEntropyStage:1] 2022-01-14 03:32:18,847 RepairSession.java:180 -
[repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for
*tablename* from */xyz.mkn.pq.16*
INFO  [AntiEntropyStage:1] 2022-01-14 03:32:18,851 RepairSession.java:180 -
[repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for
*tablename* from */xyz.mkn.pq.18*
INFO  [AntiEntropyStage:1] 2022-01-14 03:32:18,856 RepairSession.java:180 -
[repair #6e3385e0-74d1-11ec-8e66-9f084ace9968] Received merkle tree for
*tablename* from */xyz.abc.def.14*
Line 2480: INFO  [AntiEntropyStage:1] *2022-01-14 03:32:18*,876
RepairSession.java:180 - [*repair #6e3385e0-74d1-11ec-8e66-9f084ace9968*]
Received merkle tree for *tablename* from */xyz.abc.def.12*

As per the logs merkle tree is not received from node with ip
*xyz.abc.def.13*

In the system logs of node with ip *xyz.abc.def.13, *I can see following
logs

NFO  [AntiEntropyStage:1] *2022-01-14 03:32:18*,850 Validator.java:281
- [*repair
#6e3385e0-74d1-11ec-8e66-9f084ace9968*] Sending completed merkle tree to */*
*xyz.mkn.pq.17*  for *keyspace.tablename*

>From the above I inferred that the repair task has become orphaned since it
is waiting for merkle tree from a node and it is not going to receive it
since it has been lost in the network somewhere between.

Regards
Manish

On Tue, Jan 18, 2022 at 4:39 PM Bowen Song <bo...@bso.ng> wrote:

> The entry in the debug.log is not specific to a repair session, and it
> could also be caused by reasons other than network connectivity issue, such
> as long STW GC pauses. I usually don't start troubleshooting an issue from
> the debug log, as it can be rather noisy. The system.log is a better
> starting point.
>
> If I was to troubleshoot the issue, I would start from the system logs on
> the node that initiated the repair, i.e. the node you ran the "nodetool
> repair" command on. Follow the repair ID (an UUID) in the logs on all nodes
> involved in the repair and read all related logs in chronological order to
> find out what exactly had happened.
>
> BTW, If the issue is easily reproducible, I would re-run the repair with a
> reduce scope (such as table and token range) to get less logs related to
> the repair session. Less logs means less time spend on reading and
> analysing them.
>
> Hope this helps.
> On 18/01/2022 10:03, manish khandelwal wrote:
>
> I have a Cassandra 3.11.2 cluster with two DCs. While running repair , I
> am observing the following behavior.
>
> I am seeing that node is not able to receive merkle tree from one or two
> nodes. Also I am able to see that the missing nodes did send the merkle
> tree but it was not received. This make repair hangs on consistent basis.
> In netstats I can see output as follows
>
> *Mode: NORMAL*
> *Not sending any streams. Attempted: 7858888*
> *Mismatch (Blocking): 2560*
> *Mismatch (Background): 17173*
> *Pool Name Active Pending Completed Dropped*
> *Large messages n/a 0 6313 3*
> *Small messages n/a 0 55978004 3*
> *Gossip messages n/a 0 93756 125**Does it represent network issues? In
> Debug logs I saw something*DEBUG
> [MessagingService-Outgoing-hostname/xxx.yy.zz.kk-Large] 2022-01-14
> 05:00:19,031 OutboundTcpConnection.java:349 - Error writing to
> hostname/xxx.yy.zz.kk
> java.io.IOException: Connection timed out
> at sun.nio.ch.FileDispatcherImpl.write0(Native Method) ~[na:1.8.0_221]
> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
> ~[na:1.8.0_221]
> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93) ~[na:1.8.0_221]
> at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_221]
> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
> ~[na:1.8.0_221]
> at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
> ~[na:1.8.0_221]
> at java.nio.channels.Channels.writeFully(Channels.java:98) ~[na:1.8.0_221]
> at java.nio.channels.Channels.access$000(Channels.java:61) ~[na:1.8.0_221]
> at java.nio.channels.Channels$1.write(Channels.java:174) ~[na:1.8.0_221]
> at
> net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
> ~[lz4-1.3.0.jar:na]
> at
> net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
> ~[lz4-1.3.0.jar:na] (edited)
>
> Does this show any network fluctuations?
>
> Regards
> Manish
>
>
>

Re: Hanging repairs in Cassandra

Reply via email to