Hi Tom, while I am not completely sure what might cause your issue, I just want to highlight that schema agreements were overhauled in 4.0 (1) a lot so that may be somehow related to what that ticket was trying to fix.
Regards (1) https://issues.apache.org/jira/browse/CASSANDRA-15158 On Fri, 1 Oct 2021 at 18:43, Tom Offermann <tofferm...@newrelic.com> wrote: > > When adding a datacenter to a keyspace (following the Last Pickle [Data > Center Switch][lp] playbook), I ran into a "Configuration exception merging > remote schema" error. The nodes in one datacenter didn't converge to the new > schema version, and after restarting them, I saw the symptoms described in > this Datastax article on [Fixing a table schema collision][ds], where there > were two data directories for each table in the keyspace on the nodes that > didn't converge. I followed the recovery steps in the Datastax article to > move the data from the older directories to the new directories, ran > `nodetool refresh`, and that fixed the problem. > > [lp]: https://thelastpickle.com/blog/2019/02/26/data-center-switch.html > [ds]: > https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useCreateTableCollisionFix.html > > While the Datastax article was super helpful for helping me recover, I'm left > wondering *why* this happened. If anyone can shed some light on that, or > offer advice on how I can avoid getting in this situation in the future, I > would be most appreciative. I'll describe the steps I took in more detail in > the thread. > > ## Steps > > 1. The day before, I had added the second datacenter ('dc2') to the > system_traces, system_distributed, and system_auth keyspaces and ran > `nodetool rebuild` for each of the 3 keyspaces. All of that went smoothly > with no issues. > > 2. For a large keyspace, I added the second datacenter ('dc2') with an `ALTER > KEYSPACE foo WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': > '2', 'dc2': '3'};` statement. Immediately, I saw this error in the log: > ``` > "ERROR 16:45:47 Exception in thread Thread[MigrationStage:1,5,main]" > "org.apache.cassandra.exceptions.ConfigurationException: Column family ID > mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected > 20739eb0-d92e-11e6-b42f-e7eb6f21c481)" > "\tat > org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat org.apache.cassandra.config.Schema.updateTable(Schema.java:687) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:594) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > ~[na:1.8.0_232]" > "\tat java.util.concurrent.FutureTask.run(FutureTask.java:266) > ~[na:1.8.0_232]" > "\tat > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > ~[na:1.8.0_232]" > "\tat > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [na:1.8.0_232]" > "\tat > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) > [apache-cassandra-3.11.5.jar:3.11.5]" > "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]" > ``` > > I also saw this: > ``` > "ERROR 16:46:48 Configuration exception merging remote schema" > "org.apache.cassandra.exceptions.ConfigurationException: Column family ID > mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected > 20739eb0-d92e-11e6-b42f-e7eb6f21c481)" > "\tat > org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat org.apache.cassandra.config.Schema.updateTable(Schema.java:687) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:91) > ~[apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53) > [apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) > [apache-cassandra-3.11.5.jar:3.11.5]" > "\tat > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > [na:1.8.0_232]" > "\tat java.util.concurrent.FutureTask.run(FutureTask.java:266) > [na:1.8.0_232]" > "\tat > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [na:1.8.0_232]" > "\tat > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [na:1.8.0_232]" > "\tat > org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) > [apache-cassandra-3.11.5.jar:3.11.5]" > "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]" > ``` > This error repeated several times over the next 2 minutes. > > 3. While running `nodetool describecluster` repeatedly, I saw that the nodes > in the 'dc2' datacenter converged to the new schema version quickly, but the > nodes in the original datacenter ('dc1') remained at the previous schema > version. > > 4. I waited to see if all of the nodes would converge to the new schema > version, but they still hadn't converged after roughly 10 minutes. Given the > errors I saw, I wasn't optimistic it would work out all by itself, so I > decided to restart the nodes in the 'dc1' datacenter one at a time so they > would restart with the latest schema version. > > 5. After each node restarted, `nodetool describecluster` showed it as being > on the latest schema version. So, after getting through all the 'dc1' nodes, > it looked like everything in the cluster was healthy again. > > 6. However, that's when I noticed that there were two data directories on > disk for each table in the keyspace. New writes for a table were being saved > in the newer directory, but queries for data saved in the old data directory > were returning no results. > > 7. That's when I followed the recovery steps in the Datastax article with > great success. > > ## Questions > > * My understanding is that running concurrent schema updates should always be > avoided, since that can result in schema collisions. But, in this case, I > wasn't performing multiple schema updates. I was just running a single `ALTER > KEYSPACE` statement. Any idea why a single schema update would result in a > schema collision and two data directories per table? > > * Should I have waited longer before restarting nodes? Perhaps, given enough > time, the Cassandra nodes would have all converged on the correct schema > version, and this would have resolved on it's own? > > * Any suggestions for how I can avoid this problem in the future? > > -- > Tom Offermann > Lead Software Engineer > http://newrelic.com