Re: Hanging repairs in Cassandra

Bowen Song Tue, 25 Jan 2022 04:48:10 -0800

TBH, 10 minutes is pretty low. That's more suitable for a web serverthan a database server. If it's easy to do, you may prefer to increasethat on the firewall instead of tuning Cassandra. Cassandra won't be theonly thing affected by it, and you may just save yourself some debuggingtime in the future.


On 25/01/2022 03:22, manish khandelwal wrote:

TCP aging value is 10 mins. So with 7200 seconds fortcp_keepalive_time node was going unresponsive. Is TCP aging valuetool low or right enough?


On Mon, Jan 24, 2022 at 11:32 PM Bowen Song <bo...@bso.ng> wrote:

    Is reconfiguring your firewall an option? A stateful firewall
    really shouldn't remove a TCP connection in such short time,
    unless the number of connections is very large and generally short
    lived (which often see in web servers).

    On 24/01/2022 13:03, manish khandelwal wrote:

    Hi All

    Thanks for the suggestions. The issue was
    *tcp_keepalive_time* has the default value (7200 seconds). So
    once the idle connection is broken by the firewall, the
    application (Cassandra node) was getting notified very late. Thus
    we were seeing one node sending merkle tree and other not
    receiving it. Reducing it to 60 solved the problem.

    Thanks again for the help.

    Regards
    Manish

    On Sat, Jan 22, 2022 at 12:25 PM C. Scott Andreas
    <sc...@paradoxica.net> wrote:

        Hi Manish,

        I understand this answer is non-specific and might not be the
        most helpful, but figured I’d mention — Cassandra 3.11.2 is
        nearly four years old and a large number of bugs in repair
        and other subsystems have been resolved in the time since.

        I’d recommend upgrading to the latest release in the 3.11
        series at minimum (3.11.11). You may find that the issue is
        resolved; or if not, be able to draw upon the community’s
        knowledge of a current release of the database.

        — Scott

        On Jan 21, 2022, at 8:51 PM, manish khandelwal
        <manishkhandelwa...@gmail.com> wrote:

        
        Hi All

        After going through the system.logs, I still see sometimes
        the merkle tree is not received from remote DC nodes. Local
        DC nodes respond back as soon as they send. But in case of
        remote DC, it happens that one or two dcs does not respond.

        There is considerable time lag (15-16 minutes)  between log
        snippet "*Sending completed merkle tree to /10.11.12.123
        <http://10.11.12.123> for <tablename>"* seen on remote DC
        and log snippet "*Received merkle tree for <tablename> from
        /10.12.11.231 <http://10.12.11.231>" *seen on node where
        repair was triggered.

        Regards
        Manish

        On Wed, Jan 19, 2022 at 4:29 PM manish khandelwal
        <manishkhandelwa...@gmail.com> wrote:

            We use nodetool repair -pr -full. We have scheduled
            these to run automatically. For us also it has been
            seamless on most of the clusters. This particular node
            is misbehaving for reasons unknown to me. As per your
            suggestion, going through system.logs to find that
            unknown. Will keep you posted if am able to find something.

            Regards
            Manish

            On Wed, Jan 19, 2022 at 4:10 PM Bowen Song
            <bo...@bso.ng> wrote:

                May I ask how do you run the repair? Is it manually
                via the nodetool command line tool, or a tool or
                script, such as Cassandra Reaper? If you are running
                the repairs manually, would you mind give Cassandra
                Reaper a try?

                I have a fairly large cluster under my management,
                and last time I tried "nodetool repair -full -pr" on
                a large table was maybe 3 years ago, and it randomly
                stuck (i.e. it sometimes works fine, sometimes
                stuck). To finish the repair, I had to either keep
                retrying or break down the token ranges into smaller
                subsets and use the "-st" and "-et" parameters.
                Since then I've switched to use Cassandra Reaper and
                have never had similar issues.


                On 19/01/2022 02:22, manish khandelwal wrote:

                Agree with you on that. Just wanted to highlight
                that I am experiencing the same behavior.

                Regards
                Manish

                On Tue, Jan 18, 2022, 22:50 Bowen Song
                <bo...@bso.ng> wrote:

                    The link was related to Cassandra 1.2, and it
                    was 9 years ago. Cassandra was full of bugs at
                    that time, and it has improved a lot since
                    then. For that reason, I would rather not
                    compare the issue you have with some 9 years
                    old issues someone else had.


                    On 18/01/2022 16:11, manish khandelwal wrote:

                    I am not sure what is happening but it has
                    happened thrice. It is happening that merkle
                    trees are not received from nodes of other
                    data center. Getting issue on similar lines as
                    mentioned here
                    
https://user.cassandra.apache.narkive.com/GTbqO6za/repair-hangs-when-merkle-tree-request-is-not-acknowledged


                    Regards
                    Manish

                    On Tue, Jan 18, 2022, 18:18 Bowen Song
                    <bo...@bso.ng> wrote:

                        Keep reading the log on the initiator and
                        the node sending the merkle tree, anything
                        follows that? FYI, not all log has the
                        repair ID in it, therefore please read the
                        relevant logs in the chronological order
                        without filtering (e.g. "grep") on the
                        repair ID.

                        I'm sceptical network issue is causing all
                        this. The merkle tree is send over TCP
                        connections, therefore some dropped
                        packets over a few second of network
                        connectivity issue occasionally should not
                        cause any issue to the repair. You should
                        only start to see network related issues
                        if the network problem persists over a
                        period of time close to or longer than the
                        timeout values set in the cassandra.yaml
                        file, in the case of repair it's the
                        request_timeout_in_ms which is default to
                        10 seconds.

                        Carry on examine the logs, you may find
                        something useful.

                        BTW, talking about stuck repair, in my
                        experience this can happen if two or more
                        repairs were ran concurrently on the same
                        node (regardless which node was the
                        initiator) involving the same table. This
                        could happen if you accidentally ran
                        "nodetool repair" on two nodes and both
                        involve the same table, or if you
                        cancelled and then restarted a "nodetool
                        repair" on a node without waiting or
                        killing the remannings of the first repair
                        session on other nodes.

                        On 18/01/2022 11:55, manish khandelwal wrote:

                        In the system logs, on the node where
                        repair was initiated, I see that the node
                        has requested merkle tree from all nodes
                        including itself

                        INFO  [Repair#3:1] 2022-01-14
                        03:32:18,805 RepairJob.java:172 -
                        *[repair
                        #6e3385e0-74d1-11ec-8e66-9f084ace9968*]
                        Requesting merkle trees for *tablename*
                        (to [*/xyz.abc.def.14, /xyz.abc.def.13,
                        /xyz.abc.def.12, /xyz.mkn.pq.18,
                        /xyz.mkn.pq.16, /xyz.mkn.pq.17*])
                        INFO  [AntiEntropyStage:1] 2022-01-14
                        03:32:18,841 RepairSession.java:180 -
                        [repair
                        #6e3385e0-74d1-11ec-8e66-9f084ace9968]
                        Received merkle tree for *tablename* from
                        */xyz.mkn.pq.17*
                        INFO  [AntiEntropyStage:1] 2022-01-14
                        03:32:18,847 RepairSession.java:180 -
                        [repair
                        #6e3385e0-74d1-11ec-8e66-9f084ace9968]
                        Received merkle tree for *tablename* from
                        */xyz.mkn.pq.16*
                        INFO  [AntiEntropyStage:1] 2022-01-14
                        03:32:18,851 RepairSession.java:180 -
                        [repair
                        #6e3385e0-74d1-11ec-8e66-9f084ace9968]
                        Received merkle tree for *tablename* from
                        */xyz.mkn.pq.18*
                        INFO  [AntiEntropyStage:1] 2022-01-14
                        03:32:18,856 RepairSession.java:180 -
                        [repair
                        #6e3385e0-74d1-11ec-8e66-9f084ace9968]
                        Received merkle tree for *tablename* from
                        */xyz.abc.def.14*
                        Line 2480: INFO  [AntiEntropyStage:1]
                        *2022-01-14 03:32:18*,876
                        RepairSession.java:180 - [*repair
                        #6e3385e0-74d1-11ec-8e66-9f084ace9968*]
                        Received merkle tree for *tablename* from
                        */xyz.abc.def.12*
                        *
                        *
                        As per the logs merkle tree is not
                        received from node with ip *xyz.abc.def.13*
                        *
                        *
                        In the system logs of node with ip
                        *xyz.abc.def.13, *I can see following logs

                        NFO  [AntiEntropyStage:1] *2022-01-14
                        03:32:18*,850 Validator.java:281 -
                        [*repair
                        #6e3385e0-74d1-11ec-8e66-9f084ace9968*]
                        Sending completed merkle tree to */*
                        *xyz.mkn.pq.17*  for *keyspace.tablename*

                        From the above I inferred that the repair
                        task has become orphaned since it is
                        waiting for merkle tree from a node and
                        it is not going to receive it since it
                        has been lost in the network somewhere
                        between.

                        Regards
                        Manish

                        On Tue, Jan 18, 2022 at 4:39 PM Bowen
                        Song <bo...@bso.ng> wrote:

                            The entry in the debug.log is not
                            specific to a repair session, and it
                            could also be caused by reasons other
                            than network connectivity issue, such
                            as long STW GC pauses. I usually
                            don't start troubleshooting an issue
                            from the debug log, as it can be
                            rather noisy. The system.log is a
                            better starting point.

                            If I was to troubleshoot the issue, I
                            would start from the system logs on
                            the node that initiated the repair,
                            i.e. the node you ran the "nodetool
                            repair" command on. Follow the repair
                            ID (an UUID) in the logs on all nodes
                            involved in the repair and read all
                            related logs in chronological order
                            to find out what exactly had happened.

                            BTW, If the issue is easily
                            reproducible, I would re-run the
                            repair with a reduce scope (such as
                            table and token range) to get less
                            logs related to the repair session.
                            Less logs means less time spend on
                            reading and analysing them.

                            Hope this helps.

                            On 18/01/2022 10:03, manish
                            khandelwal wrote:

                            I have a Cassandra 3.11.2 cluster
                            with two DCs. While running repair ,
                            I am observing the following behavior.

                            I am seeing that node is not able to
                            receive merkle tree from one or two
                            nodes. Also I am able to see that
                            the missing nodes did send the
                            merkle tree but it was not received.
                            This make repair hangs on consistent
                            basis. In netstats I can see output
                            as follows

                            *Mode: NORMAL*
                            *Not sending any streams. Attempted:
                            7858888*
                            *Mismatch (Blocking): 2560*
                            *Mismatch (Background): 17173*
                            *Pool Name Active Pending Completed
                            Dropped*
                            *Large messages n/a 0 6313 3*
                            *Small messages n/a 0 55978004 3*
                            *Gossip messages n/a 0 93756
                            125**Does it represent network
                            issues? In Debug logs I saw
                            something*DEBUG
                            
[MessagingService-Outgoing-hostname/xxx.yy.zz.kk-Large]
                            2022-01-14 05:00:19,031
                            OutboundTcpConnection.java:349 -
                            Error writing to hostname/xxx.yy.zz.kk
                            java.io.IOException: Connection
                            timed out
                            at sun.nio.ch
                            
<http://sun.nio.ch/>.FileDispatcherImpl.write0(Native
                            Method) ~[na:1.8.0_221]
                            at sun.nio.ch
                            
<http://sun.nio.ch/>.SocketDispatcher.write(SocketDispatcher.java:47)
                            ~[na:1.8.0_221]
                            at sun.nio.ch
                            
<http://sun.nio.ch/>.IOUtil.writeFromNativeBuffer(IOUtil.java:93)
                            ~[na:1.8.0_221]
                            at sun.nio.ch
                            <http://sun.nio.ch/>.IOUtil.write(IOUtil.java:65)
                            ~[na:1.8.0_221]
                            at sun.nio.ch
                            
<http://sun.nio.ch/>.SocketChannelImpl.write(SocketChannelImpl.java:471)
                            ~[na:1.8.0_221]
                            at
                            
java.nio.channels.Channels.writeFullyImpl(Channels.java:78)
                            ~[na:1.8.0_221]
                            at
                            
java.nio.channels.Channels.writeFully(Channels.java:98)
                            ~[na:1.8.0_221]
                            at
                            
java.nio.channels.Channels.access$000(Channels.java:61)
                            ~[na:1.8.0_221]
                            at
                            
java.nio.channels.Channels$1.write(Channels.java:174)
                            ~[na:1.8.0_221]
                            at
                            
net.jpountz.lz4.LZ4BlockOutputStream.flushBufferedData(LZ4BlockOutputStream.java:205)
                            ~[lz4-1.3.0.jar:na]
                            at
                            
net.jpountz.lz4.LZ4BlockOutputStream.write(LZ4BlockOutputStream.java:158)
                            ~[lz4-1.3.0.jar:na] (edited)

                            Does this show any network
                            fluctuations?

                            Regards
                            Manish

Re: Hanging repairs in Cassandra

Reply via email to