
Ahh...I see. That makes sense. I'll add this to the list of things to check
before making a schema change.

Thanks so much for taking the time to walk me through this. Really
appreciate all of your help!

On Fri, Oct 15, 2021 at 3:52 PM Jeff Jirsa <> wrote:

> Consistency doesnt matter for schema.
> For every host: " select id from system_schema tables WHERE
> keyspace_name=? and table_name=?" (
> )
> Then, compare that to the /path/to/data/keyspace/table-(id)/ on disk
> If any of those dont match, you've got a problem waiting to bite you on
> next restart.
> On Fri, Oct 15, 2021 at 3:48 PM Tom Offermann <>
> wrote:
>> So, if I were to do `CONSISTENCY ALL; select *` from each of the
>> system_schema tables, then on-disk and in-memory should be in sync?
>> On Fri, Oct 15, 2021 at 3:38 PM Jeff Jirsa <> wrote:
>>> Heap dumps + filesystem inspection + SELECT from schema tables.
>>> On Fri, Oct 15, 2021 at 3:28 PM Tom Offermann <>
>>> wrote:
>>>> Interesting!
>>>> Is there a way to determine if the on-disk schema and the in-memory
>>>> schema are in sync? Is there a way to force them to sync? If so, would it
>>>> help to force a sync before running an `ALTER KEYSPACE` schema change?
>>>> On Fri, Oct 15, 2021 at 3:08 PM Jeff Jirsa <> wrote:
>>>>> I would not expect an ALTER KEYSPACE to introduce a divergent CFID,
>>>>> that usually happens during a CREATE TABLE. With no other evidence or
>>>>> ability to debug, I would guess that the CFIDs diverged previously, but 
>>>>> due
>>>>> to the race(s) I described, the on-disk schema and the in-memory schema
>>>>> differed, and the ALTER KEYSPACE forces the schema from one host to be
>>>>> serialized and forced to the others, where the actual IDs get reconciled.
>>>>> You may be able to confirm/demonstrate that by looking at the
>>>>> timestamps on the data directories across all of the hosts in the cluster?
>>>>> On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann <>
>>>>> wrote:
>>>>>> Jeff,
>>>>>> Thanks for describing the race condition.
>>>>>> I understand that performing concurrent schema changes is dangerous,
>>>>>> and that running an `ALTER KEYSPACE` on one node, and then running 
>>>>>> another
>>>>>> `ALTER KEYSPACE` on a different node, before the first has fully 
>>>>>> propagated
>>>>>> throughout the cluster, can lead to schema collisions.
>>>>>> But, can running a single `ALTER KEYSPACE` on a single node also be
>>>>>> vulnerable to this race condition?
>>>>>> We were careful to make sure that all nodes in both datacenters were
>>>>>> on the same schema version ID by checking the output of `nodetool
>>>>>> describecluster`. Since all nodes were in agreement, I figured that I had
>>>>>> ruled out the possibility of concurrent schema changes.
>>>>>> As I mentioned, on the day before, we did run 3 different `ALTER
>>>>>> KEYSPACE` schema changes (to add 'dc2' to system_traces,
>>>>>> system_distributed, and system_auth) and also ran `nodetool rebuild` for
>>>>>> each of the 3 keyspaces. Is it possible that one or more of these schema
>>>>>> changes hadn't fully propagated 24 hours later, even though `nodetool
>>>>>> describecluster` showed all nodes as being on the same schema version? Is
>>>>>> there a better way to determine that I am not inadvertently issuing
>>>>>> concurrent schema changes?
>>>>>> I'm also curious about how CFIDs are generated and when new ones are
>>>>>> generated. What I've noticed is that when I successfully run `ALTER
>>>>>> KEYSPACE` to add a datacenter with no errors (and make no other schema
>>>>>> changes), then the table IDs in `system_schema.tables` remain unchanged.
>>>>>> But, when we saw the schema collision that I described in this thread, 
>>>>>> that
>>>>>> resulted in new table IDs in `system_schema.tables`. Why do these table 
>>>>>> IDs
>>>>>> normally remain unchanged? What caused new ones to be generated in the
>>>>>> error case I described?
>>>>>> --Tom
>>>>>> On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa <> wrote:
>>>>>>> I've described this race a few times on the list. It is very very
>>>>>>> dangerous to do concurrent table creation in cassandra with
>>>>>>> non-determistic CFIDs.
>>>>>>> I'll try to describe it quickly right now:
>>>>>>> - Imagine you have 3 hosts, A B and C
>>>>>>> You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS".
>>>>>>> A allocates a CFID (which is a UUID, which includes a high
>>>>>>> resolution timestamp), starts adjusting it's schema
>>>>>>> Before it can finish that schema, you connect to B and issue the
>>>>>>> same CREATE TABLE statement
>>>>>>> B allocates a DIFFERENT CFID, and starts adjusting its schema
>>>>>>> A and B both have a CFID, which they will use to make a data
>>>>>>> directory on disk, and which they will push/pull to the rest of the 
>>>>>>> cluster
>>>>>>> through schema propagation.
>>>>>>> The later CFID will be saved in the schema, because the schema is a
>>>>>>> normal cassandra table with last-write-wins semantics, but the first 
>>>>>>> CFID
>>>>>>> might be the one that's used to create the data directory on disk, and 
>>>>>>> it
>>>>>>> may have all of your data in it while you write to the table.
>>>>>>> In some cases, you'll get CFID mismatch errors on reads or writes,
>>>>>>> as the CFID in memory varies between instances.
>>>>>>> In other cases, things work fine until you restart, at which time
>>>>>>> the CFID for the table changes when you load the new schema, and data on
>>>>>>> disk isn't found.
>>>>>>> This race, unfortunately, can even occur on a single node in SOME
>>>>>>> versions of Cassandra (but not all)
>>>>>>> This is a really really really bad race in many old versions of
>>>>>>> cassandra, and a lot of the schema redesign in 4.0 is meant to solve 
>>>>>>> many
>>>>>>> of these types of problems.
>>>>>>> That this continues to be possible in old versions is scary, people
>>>>>>> running old versions should not do concurrent schema changes (especially
>>>>>>> those that CREATE tables). Alternatively, you should alert if the CFID 
>>>>>>> in
>>>>>>> memory doesnt match the CFID in the disk path. One could also change
>>>>>>> cassandra to use deterministic CFIDs  to avoid the race entirely (though
>>>>>>> deterministic CFIDs have a different problem, which is that DROP +
>>>>>>> re-CREATE with any host down potentially allows data on that down host 
>>>>>>> to
>>>>>>> come back when the host comes back online).
>>>>>>> Stronger cluster metadata starts making this much safer, so looking
>>>>>>> forward to seeing that in future releases.
>>>>>>> On Wed, Oct 13, 2021 at 10:23 AM vytenis silgalis <
>>>>>>>> wrote:
>>>>>>>> You ran the `alter keyspace` command on the original dc1 nodes or
>>>>>>>> the new dc2 nodes?
>>>>>>>> On Wed, Oct 13, 2021 at 8:15 AM Stefan Miklosovic <
>>>>>>>>> wrote:
>>>>>>>>> Hi Tom,
>>>>>>>>> while I am not completely sure what might cause your issue, I just
>>>>>>>>> want to highlight that schema agreements were overhauled in 4.0
>>>>>>>>> (1) a
>>>>>>>>> lot so that may be somehow related to what that ticket was trying
>>>>>>>>> to
>>>>>>>>> fix.
>>>>>>>>> Regards
>>>>>>>>> (1)
>>>>>>>>> On Fri, 1 Oct 2021 at 18:43, Tom Offermann <
>>>>>>>>>> wrote:
>>>>>>>>> >
>>>>>>>>> > When adding a datacenter to a keyspace (following the Last
>>>>>>>>> Pickle [Data Center Switch][lp] playbook), I ran into a "Configuration
>>>>>>>>> exception merging remote schema" error. The nodes in one datacenter 
>>>>>>>>> didn't
>>>>>>>>> converge to the new schema version, and after restarting them, I saw 
>>>>>>>>> the
>>>>>>>>> symptoms described in this Datastax article on [Fixing a table schema
>>>>>>>>> collision][ds], where there were two data directories for each table 
>>>>>>>>> in the
>>>>>>>>> keyspace on the nodes that didn't converge. I followed the recovery 
>>>>>>>>> steps
>>>>>>>>> in the Datastax article to move the data from the older directories 
>>>>>>>>> to the
>>>>>>>>> new directories, ran `nodetool refresh`, and that fixed the problem.
>>>>>>>>> >
>>>>>>>>> > [lp]:
>>>>>>>>> > [ds]:
>>>>>>>>> >
>>>>>>>>> > While the Datastax article was super helpful for helping me
>>>>>>>>> recover, I'm left wondering *why* this happened. If anyone can shed 
>>>>>>>>> some
>>>>>>>>> light on that, or offer advice on how I can avoid getting in this 
>>>>>>>>> situation
>>>>>>>>> in the future, I would be most appreciative. I'll describe the steps 
>>>>>>>>> I took
>>>>>>>>> in more detail in the thread.
>>>>>>>>> >
>>>>>>>>> > ## Steps
>>>>>>>>> >
>>>>>>>>> > 1. The day before, I had added the second datacenter ('dc2') to
>>>>>>>>> the system_traces, system_distributed, and system_auth keyspaces and 
>>>>>>>>> ran
>>>>>>>>> `nodetool rebuild` for each of the 3 keyspaces. All of that went 
>>>>>>>>> smoothly
>>>>>>>>> with no issues.
>>>>>>>>> >
>>>>>>>>> > 2. For a large keyspace, I added the second datacenter ('dc2')
>>>>>>>>> with an `ALTER KEYSPACE foo WITH replication = {'class':
>>>>>>>>> 'NetworkTopologyStrategy', 'dc1': '2', 'dc2': '3'};` statement.
>>>>>>>>> Immediately, I saw this error in the log:
>>>>>>>>> >     ```
>>>>>>>>> >     "ERROR 16:45:47 Exception in thread
>>>>>>>>> Thread[MigrationStage:1,5,main]"
>>>>>>>>> >     "org.apache.cassandra.exceptions.ConfigurationException:
>>>>>>>>> Column family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c;
>>>>>>>>> expected 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.config.CFMetaData.apply(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.config.Schema.updateTable(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.service.MigrationManager$1.runMayThrow(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> java.util.concurrent.Executors$
>>>>>>>>> ~[na:1.8.0_232]"
>>>>>>>>> >     "\tat
>>>>>>>>> ~[na:1.8.0_232]"
>>>>>>>>> >     "\tat
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>>>>>>> ~[na:1.8.0_232]"
>>>>>>>>> >     "\tat
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$
>>>>>>>>> [na:1.8.0_232]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(
>>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat ~[na:1.8.0_232]"
>>>>>>>>> >     ```
>>>>>>>>> >
>>>>>>>>> >     I also saw this:
>>>>>>>>> >     ```
>>>>>>>>> >     "ERROR 16:46:48 Configuration exception merging remote
>>>>>>>>> schema"
>>>>>>>>> >     "org.apache.cassandra.exceptions.ConfigurationException:
>>>>>>>>> Column family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c;
>>>>>>>>> expected 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.config.CFMetaData.apply(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.config.Schema.updateTable(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.service.MigrationTask$1.response(
>>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat 
>>>>>>>>> >
>>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat 
>>>>>>>>> >
>>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat
>>>>>>>>> java.util.concurrent.Executors$
>>>>>>>>> [na:1.8.0_232]"
>>>>>>>>> >     "\tat
>>>>>>>>> [na:1.8.0_232]"
>>>>>>>>> >     "\tat
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(
>>>>>>>>> [na:1.8.0_232]"
>>>>>>>>> >     "\tat
>>>>>>>>> java.util.concurrent.ThreadPoolExecutor$
>>>>>>>>> [na:1.8.0_232]"
>>>>>>>>> >     "\tat
>>>>>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(
>>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>>> >     "\tat ~[na:1.8.0_232]"
>>>>>>>>> >     ```
>>>>>>>>> >     This error repeated several times over the next 2 minutes.
>>>>>>>>> >
>>>>>>>>> > 3. While running `nodetool describecluster` repeatedly, I saw
>>>>>>>>> that the nodes in the 'dc2' datacenter converged to the new schema 
>>>>>>>>> version
>>>>>>>>> quickly, but the nodes in the original datacenter ('dc1') remained at 
>>>>>>>>> the
>>>>>>>>> previous schema version.
>>>>>>>>> >
>>>>>>>>> > 4. I waited to see if all of the nodes would converge to the new
>>>>>>>>> schema version, but they still hadn't converged after roughly 10 
>>>>>>>>> minutes.
>>>>>>>>> Given the errors I saw, I wasn't optimistic it would work out all by
>>>>>>>>> itself, so I decided to restart the nodes in the 'dc1' datacenter one 
>>>>>>>>> at a
>>>>>>>>> time so they would restart with the latest schema version.
>>>>>>>>> >
>>>>>>>>> > 5. After each node restarted, `nodetool describecluster` showed
>>>>>>>>> it as being on the latest schema version. So, after getting through 
>>>>>>>>> all the
>>>>>>>>> 'dc1' nodes, it looked like everything in the cluster was healthy 
>>>>>>>>> again.
>>>>>>>>> >
>>>>>>>>> > 6. However, that's when I noticed that there were two data
>>>>>>>>> directories on disk for each table in the keyspace. New writes for a 
>>>>>>>>> table
>>>>>>>>> were being saved in the newer directory, but queries for data saved 
>>>>>>>>> in the
>>>>>>>>> old data directory were returning no results.
>>>>>>>>> >
>>>>>>>>> > 7. That's when I followed the recovery steps in the Datastax
>>>>>>>>> article with great success.
>>>>>>>>> >
>>>>>>>>> > ## Questions
>>>>>>>>> >
>>>>>>>>> > * My understanding is that running concurrent schema updates
>>>>>>>>> should always be avoided, since that can result in schema collisions. 
>>>>>>>>> But,
>>>>>>>>> in this case, I wasn't performing multiple schema updates. I was just
>>>>>>>>> running a single `ALTER KEYSPACE` statement. Any idea why a single 
>>>>>>>>> schema
>>>>>>>>> update would result in a schema collision and two data directories per
>>>>>>>>> table?
>>>>>>>>> >
>>>>>>>>> > * Should I have waited longer before restarting nodes? Perhaps,
>>>>>>>>> given enough time, the Cassandra nodes would have all converged on the
>>>>>>>>> correct schema version, and this would have resolved on it's own?
>>>>>>>>> >
>>>>>>>>> > * Any suggestions for how I can avoid this problem in the future?
>>>>>>>>> >
>>>>>>>>> > --
>>>>>>>>> > Tom Offermann
>>>>>>>>> > Lead Software Engineer
>>>>>>>>> >
>>>>>> --
>>>>>> Tom Offermann
>>>>>> Lead Software Engineer
>>>> --
>>>> Tom Offermann
>>>> Lead Software Engineer
>> --
>> Tom Offermann
>> Lead Software Engineer

Tom Offermann
Lead Software Engineer

Reply via email to