Re: Hanging repairs in Cassandra

manish khandelwal Mon, 24 Jan 2022 19:22:30 -0800

TCP aging value is 10 mins. So with 7200 seconds  for tcp_keepalive_time
node was going unresponsive. Is TCP aging value tool low or right enough?


On Mon, Jan 24, 2022 at 11:32 PM Bowen Song <[email protected]> wrote:

> Is reconfiguring your firewall an option? A stateful firewall really
> shouldn't remove a TCP connection in such short time, unless the number of
> connections is very large and generally short lived (which often see in web
> servers).
> On 24/01/2022 13:03, manish khandelwal wrote:
>
> Hi All
>
> Thanks for the suggestions. The issue was *tcp_keepalive_time* has the
> default value (7200 seconds). So once the idle connection is broken by the
> firewall, the application (Cassandra node) was getting notified very late.
> Thus we were seeing one node sending merkle tree and other not receiving
> it. Reducing it to 60 solved the problem.
>
> Thanks again for the help.
>
> Regards
> Manish
>
> On Sat, Jan 22, 2022 at 12:25 PM C. Scott Andreas <[email protected]>
> wrote:
>
>> Hi Manish,
>>
>> I understand this answer is non-specific and might not be the most
>> helpful, but figured I’d mention — Cassandra 3.11.2 is nearly four years
>> old and a large number of bugs in repair and other subsystems have been
>> resolved in the time since.
>>
>> I’d recommend upgrading to the latest release in the 3.11 series at
>> minimum (3.11.11). You may find that the issue is resolved; or if not, be
>> able to draw upon the community’s knowledge of a current release of the
>> database.
>>
>> — Scott
>>
>> On Jan 21, 2022, at 8:51 PM, manish khandelwal <
>> [email protected]> wrote:
>>
>> 
>> Hi All
>>
>> After going through the system.logs, I still see sometimes the merkle
>> tree is not received from remote DC nodes. Local DC nodes respond back as
>> soon as they send. But in case of remote DC, it happens that one or two dcs
>> does not respond.
>>
>> There is considerable time lag (15-16 minutes)  between log snippet "*Sending
>> completed merkle tree to /10.11.12.123 <http://10.11.12.123> for
>> <tablename>"* seen on remote DC and log snippet "*Received merkle tree
>> for <tablename> from /10.12.11.231 <http://10.12.11.231>"  *seen on node
>> where repair was triggered.
>>
>> Regards
>> Manish
>>
>> On Wed, Jan 19, 2022 at 4:29 PM manish khandelwal <
>> [email protected]> wrote:
>>
>>> We use nodetool repair -pr -full. We have scheduled these to run
>>> automatically. For us also it has been seamless on most of the clusters.
>>> This particular node is misbehaving for reasons unknown to me. As per your
>>> suggestion, going through system.logs to find that unknown. Will keep you
>>> posted if am able to find something.
>>>
>>> Regards
>>> Manish
>>>
>>> On Wed, Jan 19, 2022 at 4:10 PM Bowen Song <[email protected]> wrote:
>>>
>>>> May I ask how do you run the repair? Is it manually via the nodetool
>>>> command line tool, or a tool or script, such as Cassandra Reaper? If you
>>>> are running the repairs manually, would you mind give Cassandra Reaper a
>>>> try?
>>>>
>>>> I have a fairly large cluster under my management, and last time I
>>>> tried "nodetool repair -full -pr" on a large table was maybe 3 years ago,
>>>> and it randomly stuck (i.e. it sometimes works fine, sometimes stuck). To
>>>> finish the repair, I had to either keep retrying or break down the token
>>>> ranges into smaller subsets and use the "-st" and "-et" parameters. Since
>>>> then I've switched to use Cassandra Reaper and have never had similar
>>>> issues.
>>>>
>>>>
>>>> On 19/01/2022 02:22, manish khandelwal wrote:
>>>>
>>>> Agree with you on that. Just wanted to highlight that I am experiencing
>>>> the same behavior.
>>>>
>>>> Regards
>>>> Manish
>>>>
>>>> On Tue, Jan 18, 2022, 22:50 Bowen Song <[email protected]> wrote:
>>>>
>>>>> The link was related to Cassandra 1.2, and it was 9 years ago.
>>>>> Cassandra was full of bugs at that time, and it has improved a lot since
>>>>> then. For that reason, I would rather not compare the issue you have with
>>>>> some 9 years old issues someone else had.
>>>>>
>>>>>
>>>>> On 18/01/2022 16:11, manish khandelwal wrote:
>>>>>
>>>>> I am not sure what is happening but it has happened thrice. It is
>>>>> happening that merkle trees are not received from nodes of other data
>>>>> center. Getting issue on similar lines as mentioned here
>>>>> https://user.cassandra.apache.narkive.com/GTbqO6za/repair-hangs-when-merkle-tree-request-is-not-acknowledged
>>>>>
>>>>> Regards
>>>>> Manish
>>>>>
>>>>> On Tue, Jan 18, 2022, 18:18 Bowen Song <[email protected]> wrote:
>>>>>
>>>>>> Keep reading the log on the initiator and the node sending the merkle
>>>>>> tree, anything follows that? FYI, not all log has the repair ID in it,
>>>>>> therefore please read the relevant logs in the chronological order 
>>>>>> without
>>>>>> filtering (e.g. "grep") on the repair ID.
>>>>>>
>>>>>> I'm sceptical network issue is causing all this. The merkle tree is
>>>>>> send over TCP connections, therefore some dropped packets over a few 
>>>>>> second
>>>>>> of network connectivity issue occasionally should not cause any issue to
>>>>>> the repair. You should only start to see network related issues if the
>>>>>> network problem persists over a period of time close to or longer than 
>>>>>> the
>>>>>> timeout values set in the cassandra.yaml file, in the case of repair it's
>>>>>> the request_timeout_in_ms which is default to 10 seconds.
>>>>>>
>>>>>> Carry on examine the logs, you may find something useful.
>>>>>>
>>>>>> BTW, talking about stuck repair, in my experience this can happen if
>>>>>> two or more repairs were ran concurrently on the same node (regardless
>>>>>> which node was the initiator) involving the same table. This could happen
>>>>>> if you accidentally ran "nodetool repair" on two nodes and both involve 
>>>>>> the
>>>>>> same table, or if you cancelled and then restarted a "nodetool repair" 
>>>>>> on a
>>>>>> node without waiting or killing the remannings of the first repair 
>>>>>> session
>>>>>> on other nodes.
>>>>>> On 18/01/2022 11:55, manish khandelwal wrote:
>>>>>>
>>>>>> In the system logs, on the node where repair was initiated, I see
>>>>>> that the node has requested merkle tree from all nodes including itself
>>>>>>
>>>>>> INFO  [Repair#3:1] 2022-01-14 03:32:18,805 RepairJob.java:172 - *[repair
>>>>>> #6e3385e0-74d1-11ec-8e66-9f084ace9968*] Requesting merkle trees for
>>>>>> *tablename* (to [*/xyz.abc.def.14, /xyz.abc.def.13, /xyz.abc.def.12,
>>>>>> /xyz.mkn.pq.18, /xyz.mkn.pq.16, /xyz.mkn.pq.17*])
>>>>>> INFO  [AntiEntropyStage:1] 2022-01-14 03:32:18,841
>>>>>> RepairSession.java:180 - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968]
>>>>>> Received merkle tree for *tablename* from */xyz.mkn.pq.17*
>>>>>> INFO  [AntiEntropyStage:1] 2022-01-14 03:32:18,847
>>>>>> RepairSession.java:180 - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968]
>>>>>> Received merkle tree for *tablename* from */xyz.mkn.pq.16*
>>>>>> INFO  [AntiEntropyStage:1] 2022-01-14 03:32:18,851
>>>>>> RepairSession.java:180 - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968]
>>>>>> Received merkle tree for *tablename* from */xyz.mkn.pq.18*
>>>>>> INFO  [AntiEntropyStage:1] 2022-01-14 03:32:18,856
>>>>>> RepairSession.java:180 - [repair #6e3385e0-74d1-11ec-8e66-9f084ace9968]
>>>>>> Received merkle tree for *tablename* from */xyz.abc.def.14*
>>>>>> Line 2480: INFO  [AntiEntropyStage:1] *2022-01-14 03:32:18*,876
>>>>>> RepairSession.java:180 - [*repair
>>>>>> #6e3385e0-74d1-11ec-8e66-9f084ace9968*] Received merkle tree for
>>>>>> *tablename* from */xyz.abc.def.12*
>>>>>>
>>>>>> As per the logs merkle tree is not received from node with ip
>>>>>> *xyz.abc.def.13*
>>>>>>
>>>>>> In the system logs of node with ip *xyz.abc.def.13, *I can see
>>>>>> following logs
>>>>>>
>>>>>> NFO  [AntiEntropyStage:1] *2022-01-14 03:32:18*,850
>>>>>> Validator.java:281 - [*repair #6e3385e0-74d1-11ec-8e66-9f084ace9968*]
>>>>>> Sending completed merkle tree to */* *xyz.mkn.pq.17*  for
>>>>>> *keyspace.tablename*
>>>>>>
>>>>>> From the above I inferred that the repair task has become orphaned
>>>>>> since it is waiting for merkle tree from a node and it is not going to
>>>>>> receive it since it has been lost in the network somewhere between.
>>>>>>
>>>>>> Regards
>>>>>> Manish
>>>>>>
>>>>>> On Tue, Jan 18, 2022 at 4:39 PM Bowen Song <[email protected]> wrote:
>>>>>>
>>>>>>> The entry in the debug.log is not specific to a repair session, and
>>>>>>> it could also be caused by reasons other than network connectivity 
>>>>>>> issue,
>>>>>>> such as long STW GC pauses. I usually don't start troubleshooting an 
>>>>>>> issue
>>>>>>> from the debug log, as it can be rather noisy. The system.log is a 
>>>>>>> better
>>>>>>> starting point.
>>>>>>>
>>>>>>> If I was to troubleshoot the issue, I would start from the system
>>>>>>> logs on the node that initiated the repair, i.e. the node you ran the
>>>>>>> "nodetool repair" command on. Follow the repair ID (an UUID) in the 
>>>>>>> logs on
>>>>>>> all nodes involved in the repair and read all related logs in 
>>>>>>> chronological
>>>>>>> order to find out what exactly had happened.
>>>>>>>
>>>>>>> BTW, If the issue is easily reproducible, I would re-run the repair
>>>>>>> with a reduce scope (such as table and token range) to get less logs
>>>>>>> related to the repair session. Less logs means less time spend on 
>>>>>>> reading
>>>>>>> and analysing them.
>>>>>>>
>>>>>>> Hope this helps.
>>>>>>> On 18/01/2022 10:03, manish khandelwal wrote:
>>>>>>>
>>>>>>> I have a Cassandra 3.11.2 cluster with two DCs. While running repair
>>>>>>> , I am observing the following behavior.
>>>>>>>
>>>>>>> I am seeing that node is not able to receive merkle tree from one or
>>>>>>> two nodes. Also I am able to see that the missing nodes did send the 
>>>>>>> merkle
>>>>>>> tree but it was not received. This make repair hangs on consistent 
>>>>>>> basis.
>>>>>>> In netstats I can see output as follows
>>>>>>>
>>>>>>> *Mode: NORMAL*
>>>>>>> *Not sending any streams. Attempted: 7858888*
>>>>>>> *Mismatch (Blocking): 2560*
>>>>>>> *Mismatch (Background): 17173*
>>>>>>> *Pool Name Active Pending Completed Dropped*
>>>>>>> *Large messages n/a 0 6313 3*
>>>>>>> *Small messages n/a 0 55978004 3*
>>>>>>> *Gossip messages n/a 0 93756 125**Does it represent network issues?
>>>>>>> In Debug logs I saw something*DEBUG
>>>>>>> [MessagingService-Outgoing-hostname/xxx.yy.zz.kk-Large] 2022-01-14
>>>>>>> 05:00:19,031 OutboundTcpConnection.java:349 - Error writing to
>>>>>>> hostname/xxx.yy.zz.kk
>>>>>>> java.io.IOException: Connection timed out
>>>>>>> at sun.nio.ch.FileDispatcherImpl.write0(Native Method)
>>>>>>> ~[na:1.8.0_221]
>>>>>>> at sun.nio.ch.SocketDispatcher.write(SocketDispatcher.java:47)
>>>>>>> ~[na:1.8.0_221]
>>>>>>> at sun.nio.ch.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
>>>>>>> ~[na:1.8.0_221]
>>>>>>> at sun.nio.ch.IOUtil.write(IOUtil.java:65) ~[na:1.8.0_221]
>>>>>>> at sun.nio.ch.SocketChannelImpl.write(SocketChannelImpl.java:471)
>>>>>>> ~[na:1.8.0_221]
>>>>>>> at java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
>>>>>>> ~[na:1.8.0_221]
>>>>>>> at java.nio.channels.Channels.writeFully(Channels.java:98)
>>>>>>> ~[na:1.8.0_221]
>>>>>>> at java.nio.channels.Channels.access$000(Channels.java:61)
>>>>>>> ~[na:1.8.0_221]
>>>>>>> at java.nio.channels.Channels$1.write(Channels.java:174)
>>>>>>> ~[na:1.8.0_221]
>>>>>>> at
>>>>>>> net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
>>>>>>> ~[lz4-1.3.0.jar:na]
>>>>>>> at
>>>>>>> net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
>>>>>>> ~[lz4-1.3.0.jar:na] (edited)
>>>>>>>
>>>>>>> Does this show any network fluctuations?
>>>>>>>
>>>>>>> Regards
>>>>>>> Manish
>>>>>>>
>>>>>>>
>>>>>>>

Re: Hanging repairs in Cassandra

Reply via email to