Lots of things come to mind. We need more information from you to help us understand:
How long have you had your cluster running? Is it generally working ok? Is it just one node that is misbehaving at a time? How many nodes do you need to replace? Are you doing rolling restarts instead of simultaneously? Do you have enough capacity on your machines? Did you say some of the nodes are at 90% capacity? When did this problem begin? Could something be causing a racing condition? Did you recheck the commands you used to make sure they are correct? What procedure do you use? From: Léo FERLIN SUTTON [mailto:lfer...@mailjet.com.INVALID] Sent: Thursday, February 07, 2019 9:16 AM To: user@cassandra.apache.org Subject: Re: [EXTERNAL] Re: Bootstrap keeps failing Thank you for the recommendation. We are already using datastax's recommended settings for tcp_keepalive Regards, Leo On Thu, Feb 7, 2019 at 5:49 PM Durity, Sean R <sean_r_dur...@homedepot.com> wrote: I have seen unreliable streaming (streaming that doesn’t finish) because of TCP timeouts from firewalls or switches. The default tcp_keepalive kernel parameters are usually not tuned for that. See https://docs.datastax.com/en/dse-trblshoot/doc/troubleshooting/idleFirewallLinux.html for more details. These “remote” timeouts are difficult to detect or prove if you don’t have access to the intermediate network equipment. Sean Durity From: Léo FERLIN SUTTON <lfer...@mailjet.com.INVALID> Sent: Thursday, February 07, 2019 10:26 AM To: user@cassandra.apache.org; dinesh.jo...@yahoo.com Subject: [EXTERNAL] Re: Bootstrap keeps failing Hello ! Thank you for your answers. So I have tried, multiple times, to start bootstrapping from scratch. I often have the same problem (on other nodes as well) but sometimes it works and I can move on to another node. I have joined a jstack dump and some logs. Our node was shut down at around 97% disk space used I turned it back on and it starting the bootstrap process again. The log file is the log from this attempt, same for the thread dump. Small warning, I have somewhat anonymised the log files so there may be some inconsistencies. Regards, Leo On Thu, Feb 7, 2019 at 8:13 AM dinesh.jo...@yahoo.com.INVALID <dinesh.jo...@yahoo.com.invalid <mailto:dinesh.joshi@yahoocom.invalid> > wrote: Would it be possible for you to take a thread dump & logs and share them? Dinesh On Wednesday, February 6, 2019, 10:09:11 AM PST, Léo FERLIN SUTTON <lfer...@mailjet.com.INVALID> wrote: Hello ! I am having a recurrent problem when trying to bootstrap a few new nodes. Some general info : * I am running cassandra 3.0.17 * We have about 30 nodes in our cluster * All healthy nodes have between 60% to 90% used disk space on /var/lib/cassandra So I create a new node and let auto_bootstrap do it's job. After a few days the bootstrapping node stops streaming new data but is still not a member of the cluster. `nodetool status` says the node is still joining, When this happens I run `nodetool bootstrap resume`. This usually ends up in two different ways : 1. The node fills up to 100% disk space and crashes. 2. The bootstrap resume finishes with errors When I look at `nodetool netstats -H` is looks like `bootstrap resume` does not resume but restarts a full transfer of every data from every node. This is the output I get from `nodetool resume` : [2019-02-06 01:39:14,369] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-225-big-Data.db (progress: 2113%) [2019-02-06 01:39:16,821] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-88-big-Data.db (progress: 2113%) [2019-02-06 01:39:17,003] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-89-big-Data.db (progress: 2113%) [2019-02-06 01:39:17,032] session with /10.16.XX.YYY complete (progress: 2113%) [2019-02-06 01:41:15,160] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-220-big-Data.db (progress: 2113%) [2019-02-06 01:42:02,864] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-226-big-Data.db (progress: 2113%) [2019-02-06 01:42:09,284] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-227-big-Data.db (progress: 2113%) [2019-02-06 01:42:10,522] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-228-big-Data.db (progress: 2113%) [2019-02-06 01:42:10,622] received file /var/lib/cassandra/raw/raw_17930-d7cc0590230d11e9bc0af381b0ee7ac6/mc-229-big-Data.db (progress: 2113%) [2019-02-06 01:42:11,925] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-90-big-Data.db (progress: 2114%) [2019-02-06 01:42:14,887] received file /var/lib/cassandra/data/system_distributed/repair_history-759fffad624b318180eefa9a52d1f627/mc-91-big-Data.db (progress: 2114%) [2019-02-06 01:42:14,980] session with /10.16.XX.ZZZ complete (progress: 2114%) [2019-02-06 01:42:14,980] Stream failed [2019-02-06 01:42:14,982] Error during bootstrap: Stream failed [2019-02-06 01:42:14,982] Resume bootstrap complete The bootstrap `progress` goes way over 100% and eventually fails. Right now I have a node with this output from `nodetool status` : `UJ 10.16.XX.YYY 2.93 TB 256 ? 5788f061-a3c0-46af-b712-ebeecd397bf7 c` It is almost filled with data, yet if I look at `nodetool netstats` : Receiving 480 files, 325.39 GB total. Already received 5 files, 68.32 MB total Receiving 499 files, 328.96 GB total. Already received 1 files, 1.32 GB total Receiving 506 files, 345.33 GB total. Already received 6 files, 24.19 MB total Receiving 362 files, 206.73 GB total. Already received 7 files, 34 MB total Receiving 424 files, 281.25 GB total. Already received 1 files, 1.3 GB total Receiving 581 files, 349.26 GB total. Already received 8 files, 45.96 MB total Receiving 443 files, 337.26 GB total. Already received 6 files, 96.15 MB total Receiving 424 files, 275.23 GB total. Already received 5 files, 42.67 MB total It is trying to pull all the data again. Am I missing something about the way `nodetool bootstrap resume` is supposed to be used ? Regards, Leo _____ The information in this Internet Email is confidential and may be legally privileged. It is intended solely for the addressee. Access to this Email by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, distribution or any action taken or omitted to be taken in reliance on it, is prohibited and may be unlawful. When addressed to our clients any opinions or advice contained in this Email are subject to the terms and conditions expressed in any applicable governing The Home Depot terms of business or client engagement letter. The Home Depot disclaims all responsibility and liability for the accuracy and content of this attachment and for any damages or losses arising from any inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other items of a destructive nature, which may be contained in this attachment and shall not be liable for direct, indirect, consequential or special damages in connection with this e-mail message or its attachment.