I would not expect an ALTER KEYSPACE to introduce a divergent CFID, that usually happens during a CREATE TABLE. With no other evidence or ability to debug, I would guess that the CFIDs diverged previously, but due to the race(s) I described, the on-disk schema and the in-memory schema differed, and the ALTER KEYSPACE forces the schema from one host to be serialized and forced to the others, where the actual IDs get reconciled.
You may be able to confirm/demonstrate that by looking at the timestamps on the data directories across all of the hosts in the cluster? On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann <tofferm...@newrelic.com> wrote: > Jeff, > > Thanks for describing the race condition. > > I understand that performing concurrent schema changes is dangerous, and > that running an `ALTER KEYSPACE` on one node, and then running another > `ALTER KEYSPACE` on a different node, before the first has fully propagated > throughout the cluster, can lead to schema collisions. > > But, can running a single `ALTER KEYSPACE` on a single node also be > vulnerable to this race condition? > > We were careful to make sure that all nodes in both datacenters were on > the same schema version ID by checking the output of `nodetool > describecluster`. Since all nodes were in agreement, I figured that I had > ruled out the possibility of concurrent schema changes. > > As I mentioned, on the day before, we did run 3 different `ALTER KEYSPACE` > schema changes (to add 'dc2' to system_traces, system_distributed, and > system_auth) and also ran `nodetool rebuild` for each of the 3 keyspaces. > Is it possible that one or more of these schema changes hadn't fully > propagated 24 hours later, even though `nodetool describecluster` showed > all nodes as being on the same schema version? Is there a better way to > determine that I am not inadvertently issuing concurrent schema changes? > > I'm also curious about how CFIDs are generated and when new ones are > generated. What I've noticed is that when I successfully run `ALTER > KEYSPACE` to add a datacenter with no errors (and make no other schema > changes), then the table IDs in `system_schema.tables` remain unchanged. > But, when we saw the schema collision that I described in this thread, that > resulted in new table IDs in `system_schema.tables`. Why do these table IDs > normally remain unchanged? What caused new ones to be generated in the > error case I described? > > --Tom > > On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa <jji...@gmail.com> wrote: > >> I've described this race a few times on the list. It is very very >> dangerous to do concurrent table creation in cassandra with >> non-determistic CFIDs. >> >> I'll try to describe it quickly right now: >> - Imagine you have 3 hosts, A B and C >> >> You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS". >> A allocates a CFID (which is a UUID, which includes a high resolution >> timestamp), starts adjusting it's schema >> Before it can finish that schema, you connect to B and issue the same >> CREATE TABLE statement >> B allocates a DIFFERENT CFID, and starts adjusting its schema >> >> A and B both have a CFID, which they will use to make a data directory on >> disk, and which they will push/pull to the rest of the cluster through >> schema propagation. >> >> The later CFID will be saved in the schema, because the schema is a >> normal cassandra table with last-write-wins semantics, but the first CFID >> might be the one that's used to create the data directory on disk, and it >> may have all of your data in it while you write to the table. >> >> In some cases, you'll get CFID mismatch errors on reads or writes, as the >> CFID in memory varies between instances. >> In other cases, things work fine until you restart, at which time the >> CFID for the table changes when you load the new schema, and data on disk >> isn't found. >> >> This race, unfortunately, can even occur on a single node in SOME >> versions of Cassandra (but not all) >> >> This is a really really really bad race in many old versions of >> cassandra, and a lot of the schema redesign in 4.0 is meant to solve many >> of these types of problems. >> >> That this continues to be possible in old versions is scary, people >> running old versions should not do concurrent schema changes (especially >> those that CREATE tables). Alternatively, you should alert if the CFID in >> memory doesnt match the CFID in the disk path. One could also change >> cassandra to use deterministic CFIDs to avoid the race entirely (though >> deterministic CFIDs have a different problem, which is that DROP + >> re-CREATE with any host down potentially allows data on that down host to >> come back when the host comes back online). >> >> Stronger cluster metadata starts making this much safer, so looking >> forward to seeing that in future releases. >> >> >> >> On Wed, Oct 13, 2021 at 10:23 AM vytenis silgalis <vsilga...@gmail.com> >> wrote: >> >>> You ran the `alter keyspace` command on the original dc1 nodes or the >>> new dc2 nodes? >>> >>> On Wed, Oct 13, 2021 at 8:15 AM Stefan Miklosovic < >>> stefan.mikloso...@instaclustr.com> wrote: >>> >>>> Hi Tom, >>>> >>>> while I am not completely sure what might cause your issue, I just >>>> want to highlight that schema agreements were overhauled in 4.0 (1) a >>>> lot so that may be somehow related to what that ticket was trying to >>>> fix. >>>> >>>> Regards >>>> >>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-15158 >>>> >>>> On Fri, 1 Oct 2021 at 18:43, Tom Offermann <tofferm...@newrelic.com> >>>> wrote: >>>> > >>>> > When adding a datacenter to a keyspace (following the Last Pickle >>>> [Data Center Switch][lp] playbook), I ran into a "Configuration exception >>>> merging remote schema" error. The nodes in one datacenter didn't converge >>>> to the new schema version, and after restarting them, I saw the symptoms >>>> described in this Datastax article on [Fixing a table schema >>>> collision][ds], where there were two data directories for each table in the >>>> keyspace on the nodes that didn't converge. I followed the recovery steps >>>> in the Datastax article to move the data from the older directories to the >>>> new directories, ran `nodetool refresh`, and that fixed the problem. >>>> > >>>> > [lp]: >>>> https://thelastpickle.com/blog/2019/02/26/data-center-switch.html >>>> > [ds]: >>>> https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useCreateTableCollisionFix.html >>>> > >>>> > While the Datastax article was super helpful for helping me recover, >>>> I'm left wondering *why* this happened. If anyone can shed some light on >>>> that, or offer advice on how I can avoid getting in this situation in the >>>> future, I would be most appreciative. I'll describe the steps I took in >>>> more detail in the thread. >>>> > >>>> > ## Steps >>>> > >>>> > 1. The day before, I had added the second datacenter ('dc2') to the >>>> system_traces, system_distributed, and system_auth keyspaces and ran >>>> `nodetool rebuild` for each of the 3 keyspaces. All of that went smoothly >>>> with no issues. >>>> > >>>> > 2. For a large keyspace, I added the second datacenter ('dc2') with >>>> an `ALTER KEYSPACE foo WITH replication = {'class': >>>> 'NetworkTopologyStrategy', 'dc1': '2', 'dc2': '3'};` statement. >>>> Immediately, I saw this error in the log: >>>> > ``` >>>> > "ERROR 16:45:47 Exception in thread >>>> Thread[MigrationStage:1,5,main]" >>>> > "org.apache.cassandra.exceptions.ConfigurationException: Column >>>> family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected >>>> 20739eb0-d92e-11e6-b42f-e7eb6f21c481)" >>>> > "\tat >>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:594) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>> ~[na:1.8.0_232]" >>>> > "\tat java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> ~[na:1.8.0_232]" >>>> > "\tat >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>> ~[na:1.8.0_232]" >>>> > "\tat >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>> [na:1.8.0_232]" >>>> > "\tat >>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) >>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]" >>>> > ``` >>>> > >>>> > I also saw this: >>>> > ``` >>>> > "ERROR 16:46:48 Configuration exception merging remote schema" >>>> > "org.apache.cassandra.exceptions.ConfigurationException: Column >>>> family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected >>>> 20739eb0-d92e-11e6-b42f-e7eb6f21c481)" >>>> > "\tat >>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:91) >>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53) >>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) >>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat >>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>> [na:1.8.0_232]" >>>> > "\tat java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>> [na:1.8.0_232]" >>>> > "\tat >>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>> [na:1.8.0_232]" >>>> > "\tat >>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>> [na:1.8.0_232]" >>>> > "\tat >>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) >>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>> > "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]" >>>> > ``` >>>> > This error repeated several times over the next 2 minutes. >>>> > >>>> > 3. While running `nodetool describecluster` repeatedly, I saw that >>>> the nodes in the 'dc2' datacenter converged to the new schema version >>>> quickly, but the nodes in the original datacenter ('dc1') remained at the >>>> previous schema version. >>>> > >>>> > 4. I waited to see if all of the nodes would converge to the new >>>> schema version, but they still hadn't converged after roughly 10 minutes. >>>> Given the errors I saw, I wasn't optimistic it would work out all by >>>> itself, so I decided to restart the nodes in the 'dc1' datacenter one at a >>>> time so they would restart with the latest schema version. >>>> > >>>> > 5. After each node restarted, `nodetool describecluster` showed it as >>>> being on the latest schema version. So, after getting through all the 'dc1' >>>> nodes, it looked like everything in the cluster was healthy again. >>>> > >>>> > 6. However, that's when I noticed that there were two data >>>> directories on disk for each table in the keyspace. New writes for a table >>>> were being saved in the newer directory, but queries for data saved in the >>>> old data directory were returning no results. >>>> > >>>> > 7. That's when I followed the recovery steps in the Datastax article >>>> with great success. >>>> > >>>> > ## Questions >>>> > >>>> > * My understanding is that running concurrent schema updates should >>>> always be avoided, since that can result in schema collisions. But, in this >>>> case, I wasn't performing multiple schema updates. I was just running a >>>> single `ALTER KEYSPACE` statement. Any idea why a single schema update >>>> would result in a schema collision and two data directories per table? >>>> > >>>> > * Should I have waited longer before restarting nodes? Perhaps, given >>>> enough time, the Cassandra nodes would have all converged on the correct >>>> schema version, and this would have resolved on it's own? >>>> > >>>> > * Any suggestions for how I can avoid this problem in the future? >>>> > >>>> > -- >>>> > Tom Offermann >>>> > Lead Software Engineer >>>> > http://newrelic.com >>>> >>> > > -- > Tom Offermann > Lead Software Engineer > http://newrelic.com >