Jeff, Ahh...I see. That makes sense. I'll add this to the list of things to check before making a schema change.
Thanks so much for taking the time to walk me through this. Really appreciate all of your help! On Fri, Oct 15, 2021 at 3:52 PM Jeff Jirsa <jji...@gmail.com> wrote: > Consistency doesnt matter for schema. > > For every host: " select id from system_schema tables WHERE > keyspace_name=? and table_name=?" ( > https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/schema/SchemaKeyspace.java#L144 > ) > > Then, compare that to the /path/to/data/keyspace/table-(id)/ on disk > > If any of those dont match, you've got a problem waiting to bite you on > next restart. > > > > On Fri, Oct 15, 2021 at 3:48 PM Tom Offermann <tofferm...@newrelic.com> > wrote: > >> So, if I were to do `CONSISTENCY ALL; select *` from each of the >> system_schema tables, then on-disk and in-memory should be in sync? >> >> On Fri, Oct 15, 2021 at 3:38 PM Jeff Jirsa <jji...@gmail.com> wrote: >> >>> Heap dumps + filesystem inspection + SELECT from schema tables. >>> >>> >>> On Fri, Oct 15, 2021 at 3:28 PM Tom Offermann <tofferm...@newrelic.com> >>> wrote: >>> >>>> Interesting! >>>> >>>> Is there a way to determine if the on-disk schema and the in-memory >>>> schema are in sync? Is there a way to force them to sync? If so, would it >>>> help to force a sync before running an `ALTER KEYSPACE` schema change? >>>> >>>> On Fri, Oct 15, 2021 at 3:08 PM Jeff Jirsa <jji...@gmail.com> wrote: >>>> >>>>> I would not expect an ALTER KEYSPACE to introduce a divergent CFID, >>>>> that usually happens during a CREATE TABLE. With no other evidence or >>>>> ability to debug, I would guess that the CFIDs diverged previously, but >>>>> due >>>>> to the race(s) I described, the on-disk schema and the in-memory schema >>>>> differed, and the ALTER KEYSPACE forces the schema from one host to be >>>>> serialized and forced to the others, where the actual IDs get reconciled. >>>>> >>>>> You may be able to confirm/demonstrate that by looking at the >>>>> timestamps on the data directories across all of the hosts in the cluster? >>>>> >>>>> >>>>> >>>>> On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann <tofferm...@newrelic.com> >>>>> wrote: >>>>> >>>>>> Jeff, >>>>>> >>>>>> Thanks for describing the race condition. >>>>>> >>>>>> I understand that performing concurrent schema changes is dangerous, >>>>>> and that running an `ALTER KEYSPACE` on one node, and then running >>>>>> another >>>>>> `ALTER KEYSPACE` on a different node, before the first has fully >>>>>> propagated >>>>>> throughout the cluster, can lead to schema collisions. >>>>>> >>>>>> But, can running a single `ALTER KEYSPACE` on a single node also be >>>>>> vulnerable to this race condition? >>>>>> >>>>>> We were careful to make sure that all nodes in both datacenters were >>>>>> on the same schema version ID by checking the output of `nodetool >>>>>> describecluster`. Since all nodes were in agreement, I figured that I had >>>>>> ruled out the possibility of concurrent schema changes. >>>>>> >>>>>> As I mentioned, on the day before, we did run 3 different `ALTER >>>>>> KEYSPACE` schema changes (to add 'dc2' to system_traces, >>>>>> system_distributed, and system_auth) and also ran `nodetool rebuild` for >>>>>> each of the 3 keyspaces. Is it possible that one or more of these schema >>>>>> changes hadn't fully propagated 24 hours later, even though `nodetool >>>>>> describecluster` showed all nodes as being on the same schema version? Is >>>>>> there a better way to determine that I am not inadvertently issuing >>>>>> concurrent schema changes? >>>>>> >>>>>> I'm also curious about how CFIDs are generated and when new ones are >>>>>> generated. What I've noticed is that when I successfully run `ALTER >>>>>> KEYSPACE` to add a datacenter with no errors (and make no other schema >>>>>> changes), then the table IDs in `system_schema.tables` remain unchanged. >>>>>> But, when we saw the schema collision that I described in this thread, >>>>>> that >>>>>> resulted in new table IDs in `system_schema.tables`. Why do these table >>>>>> IDs >>>>>> normally remain unchanged? What caused new ones to be generated in the >>>>>> error case I described? >>>>>> >>>>>> --Tom >>>>>> >>>>>> On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa <jji...@gmail.com> wrote: >>>>>> >>>>>>> I've described this race a few times on the list. It is very very >>>>>>> dangerous to do concurrent table creation in cassandra with >>>>>>> non-determistic CFIDs. >>>>>>> >>>>>>> I'll try to describe it quickly right now: >>>>>>> - Imagine you have 3 hosts, A B and C >>>>>>> >>>>>>> You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS". >>>>>>> A allocates a CFID (which is a UUID, which includes a high >>>>>>> resolution timestamp), starts adjusting it's schema >>>>>>> Before it can finish that schema, you connect to B and issue the >>>>>>> same CREATE TABLE statement >>>>>>> B allocates a DIFFERENT CFID, and starts adjusting its schema >>>>>>> >>>>>>> A and B both have a CFID, which they will use to make a data >>>>>>> directory on disk, and which they will push/pull to the rest of the >>>>>>> cluster >>>>>>> through schema propagation. >>>>>>> >>>>>>> The later CFID will be saved in the schema, because the schema is a >>>>>>> normal cassandra table with last-write-wins semantics, but the first >>>>>>> CFID >>>>>>> might be the one that's used to create the data directory on disk, and >>>>>>> it >>>>>>> may have all of your data in it while you write to the table. >>>>>>> >>>>>>> In some cases, you'll get CFID mismatch errors on reads or writes, >>>>>>> as the CFID in memory varies between instances. >>>>>>> In other cases, things work fine until you restart, at which time >>>>>>> the CFID for the table changes when you load the new schema, and data on >>>>>>> disk isn't found. >>>>>>> >>>>>>> This race, unfortunately, can even occur on a single node in SOME >>>>>>> versions of Cassandra (but not all) >>>>>>> >>>>>>> This is a really really really bad race in many old versions of >>>>>>> cassandra, and a lot of the schema redesign in 4.0 is meant to solve >>>>>>> many >>>>>>> of these types of problems. >>>>>>> >>>>>>> That this continues to be possible in old versions is scary, people >>>>>>> running old versions should not do concurrent schema changes (especially >>>>>>> those that CREATE tables). Alternatively, you should alert if the CFID >>>>>>> in >>>>>>> memory doesnt match the CFID in the disk path. One could also change >>>>>>> cassandra to use deterministic CFIDs to avoid the race entirely (though >>>>>>> deterministic CFIDs have a different problem, which is that DROP + >>>>>>> re-CREATE with any host down potentially allows data on that down host >>>>>>> to >>>>>>> come back when the host comes back online). >>>>>>> >>>>>>> Stronger cluster metadata starts making this much safer, so looking >>>>>>> forward to seeing that in future releases. >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 13, 2021 at 10:23 AM vytenis silgalis < >>>>>>> vsilga...@gmail.com> wrote: >>>>>>> >>>>>>>> You ran the `alter keyspace` command on the original dc1 nodes or >>>>>>>> the new dc2 nodes? >>>>>>>> >>>>>>>> On Wed, Oct 13, 2021 at 8:15 AM Stefan Miklosovic < >>>>>>>> stefan.mikloso...@instaclustr.com> wrote: >>>>>>>> >>>>>>>>> Hi Tom, >>>>>>>>> >>>>>>>>> while I am not completely sure what might cause your issue, I just >>>>>>>>> want to highlight that schema agreements were overhauled in 4.0 >>>>>>>>> (1) a >>>>>>>>> lot so that may be somehow related to what that ticket was trying >>>>>>>>> to >>>>>>>>> fix. >>>>>>>>> >>>>>>>>> Regards >>>>>>>>> >>>>>>>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-15158 >>>>>>>>> >>>>>>>>> On Fri, 1 Oct 2021 at 18:43, Tom Offermann < >>>>>>>>> tofferm...@newrelic.com> wrote: >>>>>>>>> > >>>>>>>>> > When adding a datacenter to a keyspace (following the Last >>>>>>>>> Pickle [Data Center Switch][lp] playbook), I ran into a "Configuration >>>>>>>>> exception merging remote schema" error. The nodes in one datacenter >>>>>>>>> didn't >>>>>>>>> converge to the new schema version, and after restarting them, I saw >>>>>>>>> the >>>>>>>>> symptoms described in this Datastax article on [Fixing a table schema >>>>>>>>> collision][ds], where there were two data directories for each table >>>>>>>>> in the >>>>>>>>> keyspace on the nodes that didn't converge. I followed the recovery >>>>>>>>> steps >>>>>>>>> in the Datastax article to move the data from the older directories >>>>>>>>> to the >>>>>>>>> new directories, ran `nodetool refresh`, and that fixed the problem. >>>>>>>>> > >>>>>>>>> > [lp]: >>>>>>>>> https://thelastpickle.com/blog/2019/02/26/data-center-switch.html >>>>>>>>> > [ds]: >>>>>>>>> https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useCreateTableCollisionFix.html >>>>>>>>> > >>>>>>>>> > While the Datastax article was super helpful for helping me >>>>>>>>> recover, I'm left wondering *why* this happened. If anyone can shed >>>>>>>>> some >>>>>>>>> light on that, or offer advice on how I can avoid getting in this >>>>>>>>> situation >>>>>>>>> in the future, I would be most appreciative. I'll describe the steps >>>>>>>>> I took >>>>>>>>> in more detail in the thread. >>>>>>>>> > >>>>>>>>> > ## Steps >>>>>>>>> > >>>>>>>>> > 1. The day before, I had added the second datacenter ('dc2') to >>>>>>>>> the system_traces, system_distributed, and system_auth keyspaces and >>>>>>>>> ran >>>>>>>>> `nodetool rebuild` for each of the 3 keyspaces. All of that went >>>>>>>>> smoothly >>>>>>>>> with no issues. >>>>>>>>> > >>>>>>>>> > 2. For a large keyspace, I added the second datacenter ('dc2') >>>>>>>>> with an `ALTER KEYSPACE foo WITH replication = {'class': >>>>>>>>> 'NetworkTopologyStrategy', 'dc1': '2', 'dc2': '3'};` statement. >>>>>>>>> Immediately, I saw this error in the log: >>>>>>>>> > ``` >>>>>>>>> > "ERROR 16:45:47 Exception in thread >>>>>>>>> Thread[MigrationStage:1,5,main]" >>>>>>>>> > "org.apache.cassandra.exceptions.ConfigurationException: >>>>>>>>> Column family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; >>>>>>>>> expected 20739eb0-d92e-11e6-b42f-e7eb6f21c481)" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:594) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>>>>> ~[na:1.8.0_232]" >>>>>>>>> > "\tat >>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>>>>> ~[na:1.8.0_232]" >>>>>>>>> > "\tat >>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>>>>> ~[na:1.8.0_232]" >>>>>>>>> > "\tat >>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>>> [na:1.8.0_232]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) >>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]" >>>>>>>>> > ``` >>>>>>>>> > >>>>>>>>> > I also saw this: >>>>>>>>> > ``` >>>>>>>>> > "ERROR 16:46:48 Configuration exception merging remote >>>>>>>>> schema" >>>>>>>>> > "org.apache.cassandra.exceptions.ConfigurationException: >>>>>>>>> Column family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; >>>>>>>>> expected 20739eb0-d92e-11e6-b42f-e7eb6f21c481)" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:91) >>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53) >>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66) >>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat >>>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) >>>>>>>>> [na:1.8.0_232]" >>>>>>>>> > "\tat >>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>>>>> [na:1.8.0_232]" >>>>>>>>> > "\tat >>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) >>>>>>>>> [na:1.8.0_232]" >>>>>>>>> > "\tat >>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) >>>>>>>>> [na:1.8.0_232]" >>>>>>>>> > "\tat >>>>>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84) >>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]" >>>>>>>>> > "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]" >>>>>>>>> > ``` >>>>>>>>> > This error repeated several times over the next 2 minutes. >>>>>>>>> > >>>>>>>>> > 3. While running `nodetool describecluster` repeatedly, I saw >>>>>>>>> that the nodes in the 'dc2' datacenter converged to the new schema >>>>>>>>> version >>>>>>>>> quickly, but the nodes in the original datacenter ('dc1') remained at >>>>>>>>> the >>>>>>>>> previous schema version. >>>>>>>>> > >>>>>>>>> > 4. I waited to see if all of the nodes would converge to the new >>>>>>>>> schema version, but they still hadn't converged after roughly 10 >>>>>>>>> minutes. >>>>>>>>> Given the errors I saw, I wasn't optimistic it would work out all by >>>>>>>>> itself, so I decided to restart the nodes in the 'dc1' datacenter one >>>>>>>>> at a >>>>>>>>> time so they would restart with the latest schema version. >>>>>>>>> > >>>>>>>>> > 5. After each node restarted, `nodetool describecluster` showed >>>>>>>>> it as being on the latest schema version. So, after getting through >>>>>>>>> all the >>>>>>>>> 'dc1' nodes, it looked like everything in the cluster was healthy >>>>>>>>> again. >>>>>>>>> > >>>>>>>>> > 6. However, that's when I noticed that there were two data >>>>>>>>> directories on disk for each table in the keyspace. New writes for a >>>>>>>>> table >>>>>>>>> were being saved in the newer directory, but queries for data saved >>>>>>>>> in the >>>>>>>>> old data directory were returning no results. >>>>>>>>> > >>>>>>>>> > 7. That's when I followed the recovery steps in the Datastax >>>>>>>>> article with great success. >>>>>>>>> > >>>>>>>>> > ## Questions >>>>>>>>> > >>>>>>>>> > * My understanding is that running concurrent schema updates >>>>>>>>> should always be avoided, since that can result in schema collisions. >>>>>>>>> But, >>>>>>>>> in this case, I wasn't performing multiple schema updates. I was just >>>>>>>>> running a single `ALTER KEYSPACE` statement. Any idea why a single >>>>>>>>> schema >>>>>>>>> update would result in a schema collision and two data directories per >>>>>>>>> table? >>>>>>>>> > >>>>>>>>> > * Should I have waited longer before restarting nodes? Perhaps, >>>>>>>>> given enough time, the Cassandra nodes would have all converged on the >>>>>>>>> correct schema version, and this would have resolved on it's own? >>>>>>>>> > >>>>>>>>> > * Any suggestions for how I can avoid this problem in the future? >>>>>>>>> > >>>>>>>>> > -- >>>>>>>>> > Tom Offermann >>>>>>>>> > Lead Software Engineer >>>>>>>>> > http://newrelic.com >>>>>>>>> >>>>>>>> >>>>>> >>>>>> -- >>>>>> Tom Offermann >>>>>> Lead Software Engineer >>>>>> http://newrelic.com >>>>>> >>>>> >>>> >>>> -- >>>> Tom Offermann >>>> Lead Software Engineer >>>> http://newrelic.com >>>> >>> >> >> -- >> Tom Offermann >> Lead Software Engineer >> http://newrelic.com >> > -- Tom Offermann Lead Software Engineer http://newrelic.com