We're getting DEBUG [GossipStage:1] 2019-06-12 15:20:07,797 MigrationManager.java:96 - Not pulling schema because versions match or shouldPullSchemaFrom returned false
multiple times, as it contacts the nodes. On Wed, Jun 12, 2019 at 11:35 AM Carl Mueller <carl.muel...@smartthings.com> wrote: > We only were able to scale out four nodes and then failures started > occurring, including multiple instances of nodes joining a cluster without > streaming. > > Sigh. > > On Tue, Jun 11, 2019 at 3:11 PM Carl Mueller <carl.muel...@smartthings.com> > wrote: > >> We had a three-DC (asia-tokyo/europe/us) cassandra 2.2.13 cluster, AWS, >> IPV6 >> >> Needed to scale out the asia datacenter, which was 5 nodes, europe and us >> were 25 nodes >> >> We were running into bootstrapping issues where the new node failed to >> bootstrap/stream, it failed with >> >> "java.lang.RuntimeException: A node required to move the data >> consistently is down" >> >> ...even though they were all up based on nodetool status prior to adding >> the node. >> >> First we increased the phi_convict_threshold to 12, and that did not >> help. >> >> CASSANDRA-12281 appeared similar to what we had problems with, but I >> don't think we hit that. Somewhere in there someone wrote >> >> "For us, the workaround is either deleting the data (then bootstrap >> again), or increasing the ring_delay_ms. And the larger the cluster is, the >> longer ring_delay_ms is needed. Based on our tests, for a 40 nodes cluster, >> it requires ring_delay_ms to be >50seconds. For a 70 nodes cluster, >> >100seconds. Default is 30seconds." >> >> Given the WAN nature or our DCs, we used ring_delay_ms to 100 seconds and >> it finally worked. >> >> side note: >> >> During the rolling restarts for setting phi_convict_threshold we observed >> quite a lot of status map variance between nodes (we have a program to poll >> all of a datacenter or cluster's view of the gossipinfo and statuses. AWS >> appears to have variance in networking based on the phi_convict_threshold >> advice, I'm not sure if our difficulties were typical in that regard and/or >> if our IPV6 and/or globally distributed datacenters were exacerbating >> factors. >> >> We could not reproduce this in loadtest, although loadtest is only eu and >> us (but is IPV6) >> >