Consistency doesnt matter for schema. For every host: " select id from system_schema tables WHERE keyspace_name=? and table_name=?" ( https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/schema/SchemaKeyspace.java#L144 )
Then, compare that to the /path/to/data/keyspace/table-(id)/ on disk If any of those dont match, you've got a problem waiting to bite you on next restart. On Fri, Oct 15, 2021 at 3:48 PM Tom Offermann <tofferm...@newrelic.com> wrote: > So, if I were to do `CONSISTENCY ALL; select *` from each of the > system_schema tables, then on-disk and in-memory should be in sync? > > On Fri, Oct 15, 2021 at 3:38 PM Jeff Jirsa <jji...@gmail.com> wrote: > >> Heap dumps + filesystem inspection + SELECT from schema tables. >> >> >> On Fri, Oct 15, 2021 at 3:28 PM Tom Offermann <tofferm...@newrelic.com> >> wrote: >> >>> Interesting! >>> >>> Is there a way to determine if the on-disk schema and the in-memory >>> schema are in sync? Is there a way to force them to sync? If so, would it >>> help to force a sync before running an `ALTER KEYSPACE` schema change? >>> >>> On Fri, Oct 15, 2021 at 3:08 PM Jeff Jirsa <jji...@gmail.com> wrote: >>> >>>> I would not expect an ALTER KEYSPACE to introduce a divergent CFID, >>>> that usually happens during a CREATE TABLE. With no other evidence or >>>> ability to debug, I would guess that the CFIDs diverged previously, but due >>>> to the race(s) I described, the on-disk schema and the in-memory schema >>>> differed, and the ALTER KEYSPACE forces the schema from one host to be >>>> serialized and forced to the others, where the actual IDs get reconciled. >>>> >>>> You may be able to confirm/demonstrate that by looking at the >>>> timestamps on the data directories across all of the hosts in the cluster? >>>> >>>> >>>> >>>> On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann <tofferm...@newrelic.com> >>>> wrote: >>>> >>>>> Jeff, >>>>> >>>>> Thanks for describing the race condition. >>>>> >>>>> I understand that performing concurrent schema changes is dangerous, >>>>> and that running an `ALTER KEYSPACE` on one node, and then running another >>>>> `ALTER KEYSPACE` on a different node, before the first has fully >>>>> propagated >>>>> throughout the cluster, can lead to schema collisions. >>>>> >>>>> But, can running a single `ALTER KEYSPACE` on a single node also be >>>>> vulnerable to this race condition? >>>>> >>>>> We were careful to make sure that all nodes in both datacenters were >>>>> on the same schema version ID by checking the output of `nodetool >>>>> describecluster`. Since all nodes were in agreement, I figured that I had >>>>> ruled out the possibility of concurrent schema changes. >>>>> >>>>> As I mentioned, on the day before, we did run 3 different `ALTER >>>>> KEYSPACE` schema changes (to add 'dc2' to system_traces, >>>>> system_distributed, and system_auth) and also ran `nodetool rebuild` for >>>>> each of the 3 keyspaces. Is it possible that one or more of these schema >>>>> changes hadn't fully propagated 24 hours later, even though `nodetool >>>>> describecluster` showed all nodes as being on the same schema version? Is >>>>> there a better way to determine that I am not inadvertently issuing >>>>> concurrent schema changes? >>>>> >>>>> I'm also curious about how CFIDs are generated and when new ones are >>>>> generated. What I've noticed is that when I successfully run `ALTER >>>>> KEYSPACE` to add a datacenter with no errors (and make no other schema >>>>> changes), then the table IDs in `system_schema.tables` remain unchanged. >>>>> But, when we saw the schema collision that I described in this thread, >>>>> that >>>>> resulted in new table IDs in `system_schema.tables`. Why do these table >>>>> IDs >>>>> normally remain unchanged? What caused new ones to be generated in the >>>>> error case I described? >>>>> >>>>> --Tom >>>>> >>>>> On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa <jji...@gmail.com> wrote: >>>>> >>>>>> I've described this race a few times on the list. It is very very >>>>>> dangerous to do concurrent table creation in cassandra with >>>>>> non-determistic CFIDs. >>>>>> >>>>>> I'll try to describe it quickly right now: >>>>>> - Imagine you have 3 hosts, A B and C >>>>>> >>>>>> You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS". >>>>>> A allocates a CFID (which is a UUID, which includes a high resolution >>>>>> timestamp), starts adjusting it's schema >>>>>> Before it can finish that schema, you connect to B and issue the same >>>>>> CREATE TABLE statement >>>>>> B allocates a DIFFERENT CFID, and starts adjusting its schema >>>>>> >>>>>> A and B both have a CFID, which they will use to make a data >>>>>> directory on disk, and which they will push/pull to the rest of the >>>>>> cluster >>>>>> through schema propagation. >>>>>> >>>>>> The later CFID will be saved in the schema, because the schema is a >>>>>> normal cassandra table with last-write-wins semantics, but the first CFID >>>>>> might be the one that's used to create the data directory on disk, and it >>>>>> may have all of your data in it while you write to the table. >>>>>> >>>>>> In some cases, you'll get CFID mismatch errors on reads or writes, as >>>>>> the CFID in memory varies between instances. >>>>>> In other cases, things work fine until you restart, at which time the >>>>>> CFID for the table changes when you load the new schema, and data on disk >>>>>> isn't found. >>>>>> >>>>>> This race, unfortunately, can even occur on a single node in SOME >>>>>> versions of Cassandra (but not all) >>>>>> >>>>>> This is a really really really bad race in many old versions of >>>>>> cassandra, and a lot of the schema redesign in 4.0 is meant to solve many >>>>>> of these types of problems. >>>>>> >>>>>> That this continues to be possible in old versions is scary, people >>>>>> running old versions should not do concurrent schema changes (especially >>>>>> those that CREATE tables). Alternatively, you should alert if the CFID in >>>>>> memory doesnt match the CFID in the disk path. One could also change >>>>>> cassandra to use deterministic CFIDs to avoid the race entirely (though >>>>>> deterministic CFIDs have a different problem, which is that DROP + >>>>>> re-CREATE with any host down potentially allows data on that down host to >>>>>> come back when the host comes back online). >>>>>> >>>>>> Stronger cluster metadata starts making this much safer, so looking >>>>>> forward to seeing that in future releases. >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Oct 13, 2021 at 10:23 AM vytenis silgalis < >>>>>> vsilga...@gmail.com> wrote: >>>>>> >>>>>>> You ran the `alter keyspace` command on the original dc1 nodes or >>>>>>> the new dc2 nodes? >>>>>>> >>>>>>> On Wed, Oct 13, 2021 at 8:15 AM Stefan Miklosovic < >>>>>>> stefan.mikloso...@instaclustr.com> wrote: >>>>>>> >>>>>>>> Hi Tom, >>>>>>>> >>>>>>>> while I am not completely sure what might cause your issue, I just >>>>>>>> want to highlight that schema agreements were overhauled in 4.0 (1) >>>>>>>> a >>>>>>>> lot so that may be somehow related to what that ticket was trying to >>>>>>>> fix. >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-15158 >>>>>>>> >>>>>>>> On Fri, 1 Oct 2021 at 18:43, Tom Offermann <tofferm...@newrelic.com> >>>>>>>> wrote: >>>>>>>> > >>>>>>>> > When adding a datacenter to a keyspace (following the Last Pickle >>>>>>>> [Data Center Switch][lp] playbook), I ran into a "Configuration >>>>>>>> exception >>>>>>>> merging remote schema" error. The nodes in one datacenter didn't >>>>>>>> converge >>>>>>>> to the new schema version, and after restarting them, I saw the >>>>>>>> symptoms >>>>>>>> described in this Datastax article on [Fixing a table schema >>>>>>>> collision][ds], where there were two data directories for each table >>>>>>>> in the >>>>>>>> keyspace on the nodes that didn't converge. I followed the recovery >>>>>>>> steps >>>>>>>> in the Datastax article to move the data from the older directories to >>>>>>>> the >>>>>>>> new directories, ran `nodetool refresh`, and that fixed the problem. >>>>>>>> > >>>>>>>> > [lp]: >>>>>>>> https://thelastpickle.com/blog/2019/02/26/data-center-switch.html >>>>>>>> > [ds]: >>>>>>>> https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useCreateTableCollisionFix.html >>>>>>>> > >>>>>>>> > While the Datastax article was super helpful for helping me >>>>>>>> recover, I'm left wondering *why* this happened. If anyone can shed >>>>>>>> some >>>>>>>> light on that, or offer advice on how I can avoid getting in this >>>>>>>> situation >>>>>>>> in the future, I would be most appreciative. I'll describe the steps I >>>>>>>> took >>>>>>>> in more detail in the thread. >>>>>>>> > >>>>>>>> > ## Steps >>>>>>>> > >>>>>>>> > 1. The day before, I had added the second datacenter ('dc2') to >>>>>>>> the system_traces, system_distributed, and system_auth keyspaces and >>>>>>>> ran >>>>>>>> `nodetool rebuild` for each of the 3 keyspaces. All of that went >>>>>>>> smoothly >>>>>>>> with no issues. >>>>>>>> > >>>>>>>> > 2. For a large keyspace, I added the second datacenter ('dc2') >>>>>>>> with an `ALTER KEYSPACE foo WITH replication = {'class': >>>>>>>> 'NetworkTopologyStrategy', 'dc1': '2', 'dc2': '3'};` statement. >>>>>>>> Immediately, I saw this error in the log: >>>>>>>> > ``` >>>>>>>> > "ERROR 16:45:47 Exception in thread >>>>>>>> Thread[MigrationStage:1,5,main]" >>>>>>>> > "org.apache.cassandra.exceptions.ConfigurationException: >>>>>>>> Column family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; >>>>>>>> expected 20739eb0-d92e-11e6-b42f-e7eb6f21c481)" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:594) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>>>> ~[na:1.8.0_232]" >>>>>>>> > "\tat >>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>>>> ~[na:1.8.0_232]" >>>>>>>> > "\tat >>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>>>> ~[na:1.8.0_232]" >>>>>>>> > "\tat >>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>> [na:1.8.0_232]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) >>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]" >>>>>>>> > ``` >>>>>>>> > >>>>>>>> > I also saw this: >>>>>>>> > ``` >>>>>>>> > "ERROR 16:46:48 Configuration exception merging remote schema" >>>>>>>> > "org.apache.cassandra.exceptions.ConfigurationException: >>>>>>>> Column family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; >>>>>>>> expected 20739eb0-d92e-11e6-b42f-e7eb6f21c481)" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:91) >>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53) >>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) >>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat >>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>>>> [na:1.8.0_232]" >>>>>>>> > "\tat >>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>>>> [na:1.8.0_232]" >>>>>>>> > "\tat >>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>>>> [na:1.8.0_232]" >>>>>>>> > "\tat >>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>> [na:1.8.0_232]" >>>>>>>> > "\tat >>>>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) >>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>> > "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]" >>>>>>>> > ``` >>>>>>>> > This error repeated several times over the next 2 minutes. >>>>>>>> > >>>>>>>> > 3. While running `nodetool describecluster` repeatedly, I saw >>>>>>>> that the nodes in the 'dc2' datacenter converged to the new schema >>>>>>>> version >>>>>>>> quickly, but the nodes in the original datacenter ('dc1') remained at >>>>>>>> the >>>>>>>> previous schema version. >>>>>>>> > >>>>>>>> > 4. I waited to see if all of the nodes would converge to the new >>>>>>>> schema version, but they still hadn't converged after roughly 10 >>>>>>>> minutes. >>>>>>>> Given the errors I saw, I wasn't optimistic it would work out all by >>>>>>>> itself, so I decided to restart the nodes in the 'dc1' datacenter one >>>>>>>> at a >>>>>>>> time so they would restart with the latest schema version. >>>>>>>> > >>>>>>>> > 5. After each node restarted, `nodetool describecluster` showed >>>>>>>> it as being on the latest schema version. So, after getting through >>>>>>>> all the >>>>>>>> 'dc1' nodes, it looked like everything in the cluster was healthy >>>>>>>> again. >>>>>>>> > >>>>>>>> > 6. However, that's when I noticed that there were two data >>>>>>>> directories on disk for each table in the keyspace. New writes for a >>>>>>>> table >>>>>>>> were being saved in the newer directory, but queries for data saved in >>>>>>>> the >>>>>>>> old data directory were returning no results. >>>>>>>> > >>>>>>>> > 7. That's when I followed the recovery steps in the Datastax >>>>>>>> article with great success. >>>>>>>> > >>>>>>>> > ## Questions >>>>>>>> > >>>>>>>> > * My understanding is that running concurrent schema updates >>>>>>>> should always be avoided, since that can result in schema collisions. >>>>>>>> But, >>>>>>>> in this case, I wasn't performing multiple schema updates. I was just >>>>>>>> running a single `ALTER KEYSPACE` statement. Any idea why a single >>>>>>>> schema >>>>>>>> update would result in a schema collision and two data directories per >>>>>>>> table? >>>>>>>> > >>>>>>>> > * Should I have waited longer before restarting nodes? Perhaps, >>>>>>>> given enough time, the Cassandra nodes would have all converged on the >>>>>>>> correct schema version, and this would have resolved on it's own? >>>>>>>> > >>>>>>>> > * Any suggestions for how I can avoid this problem in the future? >>>>>>>> > >>>>>>>> > -- >>>>>>>> > Tom Offermann >>>>>>>> > Lead Software Engineer >>>>>>>> > http://newrelic.com >>>>>>>> >>>>>>> >>>>> >>>>> -- >>>>> Tom Offermann >>>>> Lead Software Engineer >>>>> http://newrelic.com >>>>> >>>> >>> >>> -- >>> Tom Offermann >>> Lead Software Engineer >>> http://newrelic.com >>> >> > > -- > Tom Offermann > Lead Software Engineer > http://newrelic.com >