I've noticed the joining node has a different rack than the rest of the nodes, is this intended? Will you add all new nodes to this rack and have RF=2 in that DC?

In principal, you should have equal number of servers (vnodes) in each rack, and have the rack number = RF or 1.


On 11/07/2022 13:15, Marc Hoppins wrote:

All clocks are fine.

Why would time synch would affect whether or not a node appears in the nodetool status when running the command on a different node?  Either the node is up and visible or not.

From 24 other nodes (including ba-freddy14 itself), it shows in the status.

For those other 23 nodes AND from the joining node, the one node which does not show the joining node (ba-freddy03) , is also visible to all other nodes when running nodetool.

A sample set of nodetool output follows. If you look at the last status for freddy03 you will see that the joining node (ba-freddy14) does not appear, but when I started the join, and for the following 20-25 minutes, it DID appear in the status.  So I was just asking if anyone else had experienced this behaviour.

(JOINING NODE) ba-freddy14:nodetool status -r

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN ba-freddy09   591.78 GiB  16      ? 9f7cdc62-2d5c-4d6e-be99-86c577131be5  SSW09

UJ ba-freddy14   117.37 GiB  16      ? bf85305e-256f-4eb9-9f15-5462f3b369b9  SSW05

UN ba-freddy06   614.26 GiB  16      ? 30d85b23-c66c-4781-86e9-960375caf476  SSW09

UN ba-freddy02   329.26 GiB  16      ? 3388ca94-5db5-4ef6-b7ab-e6fd0485ba49  SSW09

UN ba-freddy12   584.57 GiB  16      ? 80239a34-89cb-459b-a30f-4253bc16ed99  SSW09

UN ba-freddy07   563.51 GiB  16      ? 4de96ef6-bd48-4b16-bee1-05a0a6c9ac72  SSW09

UN ba-freddy01   578.5 GiB   16      ? 86a84980-2f8f-4d23-9099-d4b48ad9d04c  SSW09

UN ba-freddy05   575.33 GiB  16      ? 26c03d1b-9022-4e1c-bab4-d0d71bddf645  SSW09

UN ba-freddy10   581.16 GiB  16      ? 7c4051a5-1c77-4713-aa43-561063cedb3a  SSW09

UN ba-freddy08   605.92 GiB  16      ? 63fe46d1-c521-4df8-b1bb-ba0136168561  SSW09

UN ba-freddy04   585.65 GiB  16      ? 4503f80a-2890-4a3f-b0cb-d3cedc2b51d2  SSW09

UN ba-freddy11   576.46 GiB  16      ? b5b368fb-ebe3-4eed-a2a1-404b07ae2b6c  SSW09

UN ba-freddy03   568.95 GiB  16      ? 955f21a8-9bc8-4cef-b875-aa4cf7d3294c  SSW09

Datacenter: DR1

===============

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN dr1-freddy12  453.3 GiB   16      ? 533bb049-c8c9-41d9-8da6-64bdeeb6945d  SSW02

UN dr1-freddy08  448.99 GiB  16      ? 6e8c42d2-0f6d-4203-9bf7-5c5fe5e17093  SSW02

UN dr1-freddy07  450.07 GiB  16      ? 4c14b75a-74e8-4518-9c22-053b3a1ad991  SSW02

UN dr1-freddy02  453.69 GiB  16      ? e68298d7-e5eb-421f-a586-d5ee3c026627  SSW02

UN dr1-freddy10  453.17 GiB  16      ? 998bc6cb-7412-411a-89a6-ef5689d61a4a  SSW02

UN dr1-freddy05  463.07 GiB  16      ? 07876bd9-5374-4df8-a480-168b4c06f9f1  SSW02

UN dr1-freddy11  452.7 GiB   16      ? 38fca1c2-59da-4181-93a6-979b937b3fd9  SSW02

UN dr1-freddy03  460.23 GiB  16      ? a1ab1b4b-ccdc-4cb2-ad59-e9e67f0ddfbb  SSW02

UN dr1-freddy04  462.87 GiB  16      ? 29ee0eff-010d-4fbb-b204-095de2225031  SSW02

UN dr1-freddy06  454.26 GiB  16      ? 51467fd3-b795-4ba1-8eec-58b1030cb9c5  SSW02

UN dr1-freddy09  446.01 GiB  16      ? b071e232-b275-4ce7-809c-7c8fe546fbb4  SSW02

UN dr1-freddy01  450.6 GiB   16      ? c2340595-c3ec-440c-b978-62f62fd98a9a  SSW02

ba-freddy06:nodetool status -r

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN ba-freddy09   591.59 GiB  16      ? 9f7cdc62-2d5c-4d6e-be99-86c577131be5  SSW09

UJ ba-freddy14   117.37 GiB  16      ? bf85305e-256f-4eb9-9f15-5462f3b369b9  SSW05

UN ba-freddy06   614.12 GiB  16      ?     30d85b23-c66c-4781-86e9-960375caf476  SSW09

UN ba-freddy02   329.03 GiB  16      ? 3388ca94-5db5-4ef6-b7ab-e6fd0485ba49  SSW09

UN ba-freddy12   584.4 GiB   16      ? 80239a34-89cb-459b-a30f-4253bc16ed99  SSW09

UN ba-freddy07   563.36 GiB  16      ? 4de96ef6-bd48-4b16-bee1-05a0a6c9ac72  SSW09

UN ba-freddy01   578.36 GiB  16      ? 86a84980-2f8f-4d23-9099-d4b48ad9d04c  SSW09

UN ba-freddy05   575.19 GiB  16      ? 26c03d1b-9022-4e1c-bab4-d0d71bddf645  SSW09

UN ba-freddy10   580.93 GiB  16      ? 7c4051a5-1c77-4713-aa43-561063cedb3a  SSW09

UN ba-freddy08   605.79 GiB  16      ? 63fe46d1-c521-4df8-b1bb-ba0136168561  SSW09

UN ba-freddy04   585.5 GiB   16      ? 4503f80a-2890-4a3f-b0cb-d3cedc2b51d2  SSW09

UN ba-freddy11   576.31 GiB  16      ? b5b368fb-ebe3-4eed-a2a1-404b07ae2b6c  SSW09

UN ba-freddy03   568.81 GiB  16      ? 955f21a8-9bc8-4cef-b875-aa4cf7d3294c  SSW09

Datacenter: DR1

===============

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN dr1-freddy12  453.15 GiB  16      ? 533bb049-c8c9-41d9-8da6-64bdeeb6945d  SSW02

UN dr1-freddy08  448.82 GiB  16      ? 6e8c42d2-0f6d-4203-9bf7-5c5fe5e17093  SSW02

UN dr1-freddy07  449.9 GiB   16      ? 4c14b75a-74e8-4518-9c22-053b3a1ad991  SSW02

UN dr1-freddy02  453.45 GiB  16      ? e68298d7-e5eb-421f-a586-d5ee3c026627  SSW02

UN dr1-freddy10  453.02 GiB  16      ?    998bc6cb-7412-411a-89a6-ef5689d61a4a  SSW02

UN dr1-freddy05  462.92 GiB  16      ? 07876bd9-5374-4df8-a480-168b4c06f9f1  SSW02

UN dr1-freddy11  452.55 GiB  16      ? 38fca1c2-59da-4181-93a6-979b937b3fd9  SSW02

UN dr1-freddy03  460.08 GiB  16      ? a1ab1b4b-ccdc-4cb2-ad59-e9e67f0ddfbb  SSW02

UN dr1-freddy04  462.72 GiB  16      ? 29ee0eff-010d-4fbb-b204-095de2225031  SSW02

UN dr1-freddy06  454.11 GiB  16      ? 51467fd3-b795-4ba1-8eec-58b1030cb9c5  SSW02

UN dr1-freddy09  445.78 GiB  16      ? b071e232-b275-4ce7-809c-7c8fe546fbb4  SSW02

UN dr1-freddy01  450.46 GiB  16      ? c2340595-c3ec-440c-b978-62f62fd98a9a  SSW02

dr1-freddy04: nodetool status -r

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN ba-freddy09   592.05 GiB  16      ? 9f7cdc62-2d5c-4d6e-be99-86c577131be5  SSW09

UJ ba-freddy14   117.37 GiB  16      ? bf85305e-256f-4eb9-9f15-5462f3b369b9  SSW05

UN ba-freddy06   614.56 GiB  16      ? 30d85b23-c66c-4781-86e9-960375caf476  SSW09

UN ba-freddy02   329.57 GiB  16      ? 3388ca94-5db5-4ef6-b7ab-e6fd0485ba49  SSW09

UN ba-freddy12   584.88 GiB  16      ? 80239a34-89cb-459b-a30f-4253bc16ed99  SSW09

UN ba-freddy07   563.84 GiB  16      ? 4de96ef6-bd48-4b16-bee1-05a0a6c9ac72  SSW09

UN ba-freddy01   578.75 GiB  16      ? 86a84980-2f8f-4d23-9099-d4b48ad9d04c  SSW09

UN ba-freddy05   575.54 GiB  16      ? 26c03d1b-9022-4e1c-bab4-d0d71bddf645  SSW09

UN ba-freddy10   581.48 GiB  16      ? 7c4051a5-1c77-4713-aa43-561063cedb3a  SSW09

UN ba-freddy08   606.12 GiB  16      ? 63fe46d1-c521-4df8-b1bb-ba0136168561  SSW09

UN ba-freddy04   585.89 GiB  16      ? 4503f80a-2890-4a3f-b0cb-d3cedc2b51d2  SSW09

UN ba-freddy11   576.71 GiB  16      ? b5b368fb-ebe3-4eed-a2a1-404b07ae2b6c  SSW09

UN ba-freddy03   569.22 GiB  16      ? 955f21a8-9bc8-4cef-b875-aa4cf7d3294c  SSW09

Datacenter: DR1

===============

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN dr1-freddy12  453.6 GiB   16      ? 533bb049-c8c9-41d9-8da6-64bdeeb6945d  SSW02

UN dr1-freddy08  449.3 GiB   16      ? 6e8c42d2-0f6d-4203-9bf7-5c5fe5e17093  SSW02

UN dr1-freddy07  450.42 GiB  16      ? 4c14b75a-74e8-4518-9c22-053b3a1ad991  SSW02

UN dr1-freddy02  454.02 GiB  16      ? e68298d7-e5eb-421f-a586-d5ee3c026627  SSW02

UN dr1-freddy10  453.45 GiB  16      ? 998bc6cb-7412-411a-89a6-ef5689d61a4a  SSW02

UN dr1-freddy05  463.36 GiB  16      ? 07876bd9-5374-4df8-a480-168b4c06f9f1  SSW02

UN dr1-freddy11  453.01 GiB  16      ? 38fca1c2-59da-4181-93a6-979b937b3fd9  SSW02

UN dr1-freddy03  460.55 GiB  16      ? a1ab1b4b-ccdc-4cb2-ad59-e9e67f0ddfbb  SSW02

UN dr1-freddy04  463.19 GiB  16      ? 29ee0eff-010d-4fbb-b204-095de2225031  SSW02

UN dr1-freddy06  454.5 GiB   16      ? 51467fd3-b795-4ba1-8eec-58b1030cb9c5  SSW02

UN dr1-freddy09  446.3 GiB   16      ? b071e232-b275-4ce7-809c-7c8fe546fbb4  SSW02

UN dr1-freddy01  450.79 GiB  16      ? c2340595-c3ec-440c-b978-62f62fd98a9a  SSW02

dr1-freddy11: nodetool status -r

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN ba-freddy09   592.14 GiB  16      ? 9f7cdc62-2d5c-4d6e-be99-86c577131be5  SSW09

UJ ba-freddy14   117.37 GiB  16      ? bf85305e-256f-4eb9-9f15-5462f3b369b9  SSW05

UN ba-freddy06   614.56 GiB  16      ? 30d85b23-c66c-4781-86e9-960375caf476  SSW09

UN ba-freddy02   329.57 GiB  16      ? 3388ca94-5db5-4ef6-b7ab-e6fd0485ba49  SSW09

UN ba-freddy12   584.88 GiB  16      ? 80239a34-89cb-459b-a30f-4253bc16ed99  SSW09

UN ba-freddy07   563.84 GiB  16      ? 4de96ef6-bd48-4b16-bee1-05a0a6c9ac72  SSW09

UN ba-freddy01   578.75 GiB  16      ? 86a84980-2f8f-4d23-9099-d4b48ad9d04c  SSW09

UN ba-freddy05   575.61 GiB  16      ? 26c03d1b-9022-4e1c-bab4-d0d71bddf645  SSW09

UN ba-freddy10   581.48 GiB  16      ? 7c4051a5-1c77-4713-aa43-561063cedb3a  SSW09

UN ba-freddy08   606.19 GiB  16      ? 63fe46d1-c521-4df8-b1bb-ba0136168561  SSW09

UN ba-freddy04   585.98 GiB  16      ? 4503f80a-2890-4a3f-b0cb-d3cedc2b51d2  SSW09

UN ba-freddy11   576.77 GiB  16      ? b5b368fb-ebe3-4eed-a2a1-404b07ae2b6c  SSW09

UN ba-freddy03   569.22 GiB  16      ? 955f21a8-9bc8-4cef-b875-aa4cf7d3294c  SSW09

Datacenter: DR1

===============

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN dr1-freddy12  453.6 GiB   16      ? 533bb049-c8c9-41d9-8da6-64bdeeb6945d  SSW02

UN dr1-freddy08  449.3 GiB   16      ? 6e8c42d2-0f6d-4203-9bf7-5c5fe5e17093  SSW02

UN dr1-freddy07  450.42 GiB  16      ? 4c14b75a-74e8-4518-9c22-053b3a1ad991  SSW02

UN dr1-freddy02  454.02 GiB  16      ? e68298d7-e5eb-421f-a586-d5ee3c026627  SSW02

UN dr1-freddy10  453.45 GiB  16      ? 998bc6cb-7412-411a-89a6-ef5689d61a4a  SSW02

UN dr1-freddy05  463.36 GiB  16      ? 07876bd9-5374-4df8-a480-168b4c06f9f1  SSW02

UN dr1-freddy11  453.01 GiB  16      ? 38fca1c2-59da-4181-93a6-979b937b3fd9  SSW02

UN dr1-freddy03  460.55 GiB  16      ? a1ab1b4b-ccdc-4cb2-ad59-e9e67f0ddfbb  SSW02

UN dr1-freddy04  463.19 GiB  16      ? 29ee0eff-010d-4fbb-b204-095de2225031  SSW02

UN dr1-freddy06  454.5 GiB   16      ? 51467fd3-b795-4ba1-8eec-58b1030cb9c5  SSW02

UN dr1-freddy09  446.3 GiB   16      ? b071e232-b275-4ce7-809c-7c8fe546fbb4  SSW02

UN dr1-freddy01  450.86 GiB  16      ? c2340595-c3ec-440c-b978-62f62fd98a9a  SSW02

ba-freddy03: nodetool status -r

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN ba-freddy09   592.23 GiB  16      ? 9f7cdc62-2d5c-4d6e-be99-86c577131be5  SSW09

UN ba-freddy06   614.63 GiB  16      ? 30d85b23-c66c-4781-86e9-960375caf476  SSW09

UN ba-freddy02   329.66 GiB  16      ? 3388ca94-5db5-4ef6-b7ab-e6fd0485ba49  SSW09

UN ba-freddy12   584.97 GiB  16      ? 80239a34-89cb-459b-a30f-4253bc16ed99  SSW09

UN ba-freddy07   563.91 GiB  16      ? 4de96ef6-bd48-4b16-bee1-05a0a6c9ac72  SSW09

UN ba-freddy01   578.83 GiB  16      ? 86a84980-2f8f-4d23-9099-d4b48ad9d04c  SSW09

UN ba-freddy05   575.69 GiB  16      ? 26c03d1b-9022-4e1c-bab4-d0d71bddf645  SSW09

UN ba-freddy10   581.56 GiB  16      ? 7c4051a5-1c77-4713-aa43-561063cedb3a  SSW09

UN ba-freddy08   606.27 GiB  16      ? 63fe46d1-c521-4df8-b1bb-ba0136168561  SSW09

UN ba-freddy04   586.05 GiB  16      ? 4503f80a-2890-4a3f-b0cb-d3cedc2b51d2  SSW09

UN ba-freddy11   576.86 GiB  16      ? b5b368fb-ebe3-4eed-a2a1-404b07ae2b6c  SSW09

UN ba-freddy03   569.32 GiB  16      ? 955f21a8-9bc8-4cef-b875-aa4cf7d3294c  SSW09

Datacenter: DR1

===============

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address                            Load        Tokens  Owns Host ID                               Rack

UN dr1-freddy12  453.68 GiB  16      ? 533bb049-c8c9-41d9-8da6-64bdeeb6945d  SSW02

UN dr1-freddy08  449.39 GiB  16      ? 6e8c42d2-0f6d-4203-9bf7-5c5fe5e17093  SSW02

UN dr1-freddy07  450.51 GiB  16      ? 4c14b75a-74e8-4518-9c22-053b3a1ad991  SSW02

UN dr1-freddy02  454.11 GiB  16      ? e68298d7-e5eb-421f-a586-d5ee3c026627  SSW02

UN dr1-freddy10  453.54 GiB  16      ? 998bc6cb-7412-411a-89a6-ef5689d61a4a  SSW02

UN dr1-freddy05  463.44 GiB  16      ? 07876bd9-5374-4df8-a480-168b4c06f9f1  SSW02

UN dr1-freddy11  453.1 GiB   16      ? 38fca1c2-59da-4181-93a6-979b937b3fd9  SSW02

UN dr1-freddy03  460.62 GiB  16      ? a1ab1b4b-ccdc-4cb2-ad59-e9e67f0ddfbb  SSW02

UN dr1-freddy04  463.27 GiB  16      ? 29ee0eff-010d-4fbb-b204-095de2225031  SSW02

UN dr1-freddy06  454.57 GiB  16      ? 51467fd3-b795-4ba1-8eec-58b1030cb9c5  SSW02

UN dr1-freddy09  446.39 GiB  16      ? b071e232-b275-4ce7-809c-7c8fe546fbb4  SSW02

UN dr1-freddy01  450.94 GiB  16      ? c2340595-c3ec-440c-b978-62f62fd98a9a  SSW02

*From:*Joe Obernberger <joseph.obernber...@gmail.com>
*Sent:* Monday, July 11, 2022 1:29 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Adding nodes

EXTERNAL

I too came from HBase and discovered adding several nodes at a time doesn't work.  Are you absolutely sure that the clocks are in sync across the nodes?  This has bitten me several times.

-Joe

On 7/11/2022 6:23 AM, Bowen Song via user wrote:

    You should look for warning and error level logs in the
    system.log, not the debug.log or gc.log, and certainly not only
    the latest lines.

    BTW, you may want to spend some time investigating potential GC
    issues based on the GC logs you provided. I can see 1 full GC in
    the 3 hours since the node started. It's not necessarily a problem
    (if it only occasionally happens during the initial bootstraping
    process), but it should justify an investigation if this is the
    first time you've seen it.

    On 11/07/2022 11:09, Marc Hoppins wrote:

        Service still running. No errors showing.

        The latest info is in debug.log

        DEBUG [Streaming-EventLoop-4-3] 2022-07-11 12:00:38,902
        NettyStreamingMessageSender.java:258 - [Stream
        #befbc5d0-00e7-11ed-860a-a139feb6a78a channel: 053f2911]
        Sending keep-alive

        DEBUG [Stream-Deserializer-/10.1.146.174:7000-053f2911]
        2022-07-11 12:00:39,790 StreamingInboundHandler.java:179 -
        [Stream #befbc5d0-00e7-11ed-860a-a139feb6a78a channel:
        053f2911] Received keep-alive

        DEBUG [ScheduledTasks:1] 2022-07-11 12:00:44,688
        StorageService.java:2398 - Ignoring application state LOAD
        from /x.x.x.64:7000 because it is not a member in token metadata

        DEBUG [ScheduledTasks:1] 2022-07-11 12:01:44,689
        StorageService.java:2398 - Ignoring application state LOAD
        from /x.x.x.64:7000 because it is not a member in token metadata

        DEBUG [ScheduledTasks:1] 2022-07-11 12:02:44,690
        StorageService.java:2398 - Ignoring application state LOAD
        from /x.x.x.64:7000 because it is not a member in token metadata

        And

        gc.log.1.current

        2022-07-11T12:08:40.562+0200: 11122.837: [GC (Allocation
        Failure) 2022-07-11T12:08:40.562+0200: 11122.838: [ParNew

        Desired survivor size 41943040 bytes, new threshold 1 (max 1)

        - age   1: 57264 bytes,      57264 total

        : 655440K->74K(737280K), 0.0289143 secs]
        2575800K->1920436K(8128512K), 0.0291355 secs] [Times:
        user=0.23 sys=0.00, real=0.03 secs]

        Heap after GC invocations=6532 (full 1):

        par new generation   total 737280K, used 74K
        [0x00000005cae00000, 0x00000005fce00000, 0x00000005fce00000)

          eden space 655360K,   0% used [0x00000005cae00000,
        0x00000005cae00000, 0x00000005f2e00000)

          from space 81920K,   0% used [0x00000005f2e00000,
        0x00000005f2e12848, 0x00000005f7e00000)

          to   space 81920K,   0% used [0x00000005f7e00000,
        0x00000005f7e00000, 0x00000005fce00000)

        concurrent mark-sweep generation total 7391232K, used 1920362K
        [0x00000005fce00000, 0x00000007c0000000, 0x00000007c0000000)

        Metaspace used 53255K, capacity 56387K, committed 56416K,
        reserved 1097728K

          class space used 6926K, capacity 7550K, committed 7576K,
        reserved 1048576K

        }

        2022-07-11T12:08:40.591+0200: 11122.867: Total time for which
        application threads were stopped: 0.0309913 seconds, Stopping
        threads took: 0.0012599 seconds

        {Heap before GC invocations=6532 (full 1):

        par new generation   total 737280K, used 655434K
        [0x00000005cae00000, 0x00000005fce00000, 0x00000005fce00000)

          eden space 655360K, 100% used [0x00000005cae00000,
        0x00000005f2e00000, 0x00000005f2e00000)

          from space 81920K,   0% used [0x00000005f2e00000,
        0x00000005f2e12848, 0x00000005f7e00000)

          to   space 81920K,   0% used [0x00000005f7e00000,
        0x00000005f7e00000, 0x00000005fce00000)

        concurrent mark-sweep generation total 7391232K, used 1920362K
        [0x00000005fce00000, 0x00000007c0000000, 0x00000007c0000000)

        Metaspace       used 53255K, capacity 56387K, committed
        56416K, reserved 1097728K

          class space used 6926K, capacity 7550K, committed 7576K,
        reserved 1048576K

        2022-07-11T12:08:42.163+0200: 11124.438: [GC (Allocation
        Failure) 2022-07-11T12:08:42.163+0200: 11124.438: [ParNew

        Desired survivor size 41943040 bytes, new threshold 1 (max 1)

        - age   1: 54984 bytes,      54984 total

        : 655434K->80K(737280K), 0.0291754 secs]
        2575796K->1920445K(8128512K), 0.0293884 secs] [Times:
        user=0.22 sys=0.00, real=0.03 secs]

        *From:*Bowen Song via user <user@cassandra.apache.org>
        <mailto:user@cassandra.apache.org>
        *Sent:* Monday, July 11, 2022 11:56 AM
        *To:* user@cassandra.apache.org <mailto:user@cassandra.apache.org>
        *Subject:* Re: Adding nodes

        EXTERNAL

        Checking on multiple nodes won't help if the joining node
        suffers from any of the issues I described, as it will likely
        be flipping up and down frequently, and the existing nodes in
        the cluster may never reach an agreement before the joining
        node stays up (or stays down) for a while. However, it will be
        a very strange thing if this is a persistent behaviour. If the
        'nodetool status' output on each node remained unchanged for
        hours and the outputs aren't the same between nodes, it could
        be an indicator of something else that had gone wrong.

        Does the strange behaviour goes away after the joining node
        completes the streaming and fully joins the cluster?

        On 11/07/2022 10:46, Marc Hoppins wrote:

            I am beginning to wonder…

            If you recall, I stated that I had checked status on a
            bunch of other nodes from both datacentres and the joining
            node shows up. No errors are occurring anywhere; data is
            streaming; node is joining…but, as I also stated, on the
            initial node which I only used to run the nodetool status,
            the new node is no longer showing up.  Thus the new node
            has not disappeared from the cluster, only from nodetool
            status on that particular node – which is already in the
            cluster, has been so for several weeks, and is also
            functioning without error.

            *From:*Bowen Song via user <user@cassandra.apache.org>
            <mailto:user@cassandra.apache.org>
            *Sent:* Monday, July 11, 2022 11:40 AM
            *To:* user@cassandra.apache.org
            <mailto:user@cassandra.apache.org>
            *Subject:* Re: Adding nodes

            EXTERNAL

            A node in joining state can disappearing from the cluster
            from other nodes' perspective if the joining node stops
            sending/receiving gossip messages to other nodes. This can
            happen when the joining node is severely overloaded, has
            bad network connectivity or stuck in long STW GC pauses.
            Regardless of the reason behind it, the state shown on the
            joining node will remain as joining unless the steaming
            process has failed.

            The node state is propagated between nodes via gossip, and
            there may be a delay before all existing nodes agree on
            the fact that the joining node is no longer in the
            cluster. Within that delay, different nodes in the cluster
            may show different results in 'nodetool status'.

            You should check the logs on the existing nodes and the
            joining node to find out why is it happening, and make
            appropriate changes if needed.

            On 11/07/2022 09:23, Marc Hoppins wrote:

                Further oddities…

                I was sitting here watching our new new node being
                added (nodetool status being run from one of the seed
                nodes) and all was going well.  Then I noticed that
                our new new node was no longer visible.  I checked the
                service on the new new node and it was still running.
                So I checked status from this node and it shows in the
                status report (still UJ and streaming data), but takes
                a little longer to get the results than it did when it
                was visible from the seed.

                I checked status from a few different nodes in both
                datacentres (including other seeds) and the new new
                node shows up but from the original seed node, it does
                not appear in the nodetool status. Can anyone shed any
                light on this phenomena?

                *From:*Marc Hoppins <marc.hopp...@eset.com>
                <mailto:marc.hopp...@eset.com>
                *Sent:* Monday, July 11, 2022 10:02 AM
                *To:* user@cassandra.apache.org
                <mailto:user@cassandra.apache.org>
                *Cc:* Bowen Song <bo...@bso.ng> <mailto:bo...@bso.ng>
                *Subject:* RE: Adding nodes

                Well then…

                I left this on Friday (still running) and came back to
                it today (Monday) to find the service stopped. So, I
                blitzed this node from the ring and began anew with a
                different new node.

                I rather suspect the problem was with trying to use
                Ansible to add these initially - despite the fact that
                I had a serial limit of 1 and a pause of 90s for
                starting the service on each new node (based on the
                time taken when setting up this Cassandra cluster).

                So…moving forward…

                It is recommended to only add one new node at a time
                from what I read.  This leads me to:

                Although I see the new node LOAD is progressing far
                faster than the previous failure, it is still going to
                take several hours to move from UJ to UN, which means
                I’ll be at this all week for the 12 new nodes. If our
                LOAD per node is around 400-600GB, is there any
                practical method to speed up adding multiple new nodes
                which is unlikely to cause problems?  After all, in
                the modern world of big (how big is big?) data, 600G
                per node is far less than the real BIG big-data.

                Marc

                *From:*Jeff Jirsa <jji...@gmail.com
                <mailto:jji...@gmail.com>>
                *Sent:* Friday, July 8, 2022 5:46 PM
                *To:* cassandra <user@cassandra.apache.org
                <mailto:user@cassandra.apache.org>>
                *Cc:* Bowen Song <bo...@bso.ng <mailto:bo...@bso.ng>>
                *Subject:* Re: Adding nodes

                EXTERNAL

                Having a node UJ but not sending/receiving other
                streams is an invalid state (unless 4.0 moved the
                streaming data out of netstats? I'm not 100% sure, but
                I'm 99% sure it should be there).

                It likely stopped the bootstrap process long ago with
                an error (which you may not have seen), and is running
                without being in the ring, but also not trying to join
                the ring.

                145GB vs 1.1T could be bits vs bytes (that's a factor
                of 8), or it could be that you streamed data and
                compacted it away. Hard to say, but less important -
                the fact that it's UJ but not streaming means there's
                a different problem.

                If it's me, I do this (not guaranteed to work, your
                mileage may vary, etc):

                1) Look for errors in the logs of ALL hosts. In the
                joining host, look for an exception that stops
                bootstrap. In the others, look for messages about
                errors streaming, and/or exceptions around file
                access. In all of those hosts, check to see if any of
                them think they're streaming ( nodetool netstats again)

                2) Stop the joining host. It's almost certainly not
                going to finish now. Remove data directories,
                commitlog directory, saved caches, hints. Wait 2
                minutes. Make sure every other host in the cluster
                sees it disappear from the ring. Then, start it fresh
                and let it bootstrap again. (you could alternatively
                try the resumable bootstrap option, but I never use it).

                On Fri, Jul 8, 2022 at 2:56 AM Marc Hoppins
                <marc.hopp...@eset.com> wrote:

                    Ifconfig shows RX of 1.1T. This doesn't seem to
                    fit with the LOAD of 145GiB (nodetool status),
                    unless I am reading that wrong...and the fact that
                    this node still has a status of UJ.

                    Netstats on this node shows (other than :
                    Read Repair Statistics:
                    Attempted: 0
                    Mismatch (Blocking): 0
                    Mismatch (Background): 0
                    Pool Name                    Active  Pending     
                    Completed   Dropped
                    Large messages                  n/a  0           
                      0         0
                    Small messages                  n/a 53     
                    569755545  15740262
                    Gossip messages                 n/a  0       
                     288878         2
                    None of this addresses the issue of not being able
                    to add more nodes.

                    -----Original Message-----
                    From: Bowen Song via user <user@cassandra.apache.org>
                    Sent: Friday, July 8, 2022 11:47 AM
                    To: user@cassandra.apache.org
                    Subject: Re: Adding nodes

                    EXTERNAL


                    I would assume that's 85 GB (i.e. gigabytes) then.
                    Which is approximately 79 GiB (i.e. gibibytes).
                    This still sounds awfully slow - less than 1MB/s
                    over a full day (24 hours).

                    You said CPU and network aren't the bottleneck.
                    Have you checked the disk IO? Also, be mindful
                    with CPU usage. It can still be a bottleneck if
                    one thread uses 100% of a CPU core while all other
                    cores are idle.

                    On 08/07/2022 07:09, Marc Hoppins wrote:
                    > Thank you for pointing that out.
                    >
                    > 85
                    gigabytes/gibibytes/GIGABYTES/GIBIBYTES/whatever
                    name you care to
                    > give it
                    >
                    > CPU and bandwidth are not the problem.
                    >
                    > Version 4.0.3 but, as I stated, all nodes use
                    the same version so the version is not important
                    either.
                    >
                    > Existing nodes have 350-400+(choose whatever you
                    want to call a
                    > gigabyte)
                    >
                    > The problem appears to be that adding new nodes
                    is a serial process, which is fine when there is
                    no data and each node is added within 2minutes. 
                    It is hardly practical in production.
                    >
                    > -----Original Message-----
                    > From: Bowen Song via user
                    <user@cassandra.apache.org>
                    > Sent: Thursday, July 7, 2022 8:43 PM
                    > To: user@cassandra.apache.org
                    > Subject: Re: Adding nodes
                    >
                    > EXTERNAL
                    >
                    >
                    > 86Gb (that's gigabits, which is 10.75GB,
                    gigabytes) took an entire day seems obviously too
                    long. I would check the network bandwidth, disk IO
                    and CPU usage and find out what is the bottleneck.
                    >
                    > On 07/07/2022 15:48, Marc Hoppins wrote:
                    >> Hi all,
                    >>
                    >> Cluster of 2 DC and 24 nodes
                    >>
                    >> DC1 (RF3) = 12 nodes, 16 tokens each
                    >> DC2 (RF3) = 12 nodes, 16 tokens each
                    >>
                    >> Adding 12 more nodes to DC1: I installed
                    Cassandra (version is the same across all nodes)
                    but, after the first node added, I couldn't seem
                    to add any further nodes.
                    >>
                    >> I check nodetool status and the newly added
                    node is UJ. It remains this way all day and only
                    86Gb of data is added to the node over the entire
                    day (probably not yet complete).  This seems a
                    little slow and, more than a little inconvenient
                    to only be able to add one node at a time - or at
                    least one node every 2 minutes. When the cluster
                    was created, I timed each node from service start
                    to status UJ (having a UUID) and it was around 120
                    seconds.  Of course there was no data.
                    >>
                    >> Is it possible I have some setting not
                    correctly tuned?
                    >>
                    >> Thanks
                    >>
                    >> Marc

------------------------------------------------------------------------

Image removed by sender. AVG logo <https://www.avg.com/internet-security>

        

This email has been checked for viruses by AVG antivirus software.
www.avg.com <https://www.avg.com/internet-security>


Reply via email to