Re: Schema collision results in multiple data directories per table

Tom Offermann Fri, 15 Oct 2021 15:27:53 -0700

Interesting!

Is there a way to determine if the on-disk schema and the in-memory schema
are in sync? Is there a way to force them to sync? If so, would it help to
force a sync before running an `ALTER KEYSPACE` schema change?


On Fri, Oct 15, 2021 at 3:08 PM Jeff Jirsa <[email protected]> wrote:

> I would not expect an ALTER KEYSPACE to introduce a divergent CFID, that
> usually happens during a CREATE TABLE. With no other evidence or ability to
> debug, I would guess that the CFIDs diverged previously, but due to the
> race(s) I described, the on-disk schema and the in-memory schema differed,
> and the ALTER KEYSPACE forces the schema from one host to be serialized and
> forced to the others, where the actual IDs get reconciled.
>
> You may be able to confirm/demonstrate that by looking at the timestamps
> on the data directories across all of the hosts in the cluster?
>
>
>
> On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann <[email protected]>
> wrote:
>
>> Jeff,
>>
>> Thanks for describing the race condition.
>>
>> I understand that performing concurrent schema changes is dangerous, and
>> that running an `ALTER KEYSPACE` on one node, and then running another
>> `ALTER KEYSPACE` on a different node, before the first has fully propagated
>> throughout the cluster, can lead to schema collisions.
>>
>> But, can running a single `ALTER KEYSPACE` on a single node also be
>> vulnerable to this race condition?
>>
>> We were careful to make sure that all nodes in both datacenters were on
>> the same schema version ID by checking the output of `nodetool
>> describecluster`. Since all nodes were in agreement, I figured that I had
>> ruled out the possibility of concurrent schema changes.
>>
>> As I mentioned, on the day before, we did run 3 different `ALTER
>> KEYSPACE` schema changes (to add 'dc2' to system_traces,
>> system_distributed, and system_auth) and also ran `nodetool rebuild` for
>> each of the 3 keyspaces. Is it possible that one or more of these schema
>> changes hadn't fully propagated 24 hours later, even though `nodetool
>> describecluster` showed all nodes as being on the same schema version? Is
>> there a better way to determine that I am not inadvertently issuing
>> concurrent schema changes?
>>
>> I'm also curious about how CFIDs are generated and when new ones are
>> generated. What I've noticed is that when I successfully run `ALTER
>> KEYSPACE` to add a datacenter with no errors (and make no other schema
>> changes), then the table IDs in `system_schema.tables` remain unchanged.
>> But, when we saw the schema collision that I described in this thread, that
>> resulted in new table IDs in `system_schema.tables`. Why do these table IDs
>> normally remain unchanged? What caused new ones to be generated in the
>> error case I described?
>>
>> --Tom
>>
>> On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa <[email protected]> wrote:
>>
>>> I've described this race a few times on the list. It is very very
>>> dangerous to do concurrent table creation in cassandra with
>>> non-determistic CFIDs.
>>>
>>> I'll try to describe it quickly right now:
>>> - Imagine you have 3 hosts, A B and C
>>>
>>> You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS".
>>> A allocates a CFID (which is a UUID, which includes a high resolution
>>> timestamp), starts adjusting it's schema
>>> Before it can finish that schema, you connect to B and issue the same
>>> CREATE TABLE statement
>>> B allocates a DIFFERENT CFID, and starts adjusting its schema
>>>
>>> A and B both have a CFID, which they will use to make a data directory
>>> on disk, and which they will push/pull to the rest of the cluster through
>>> schema propagation.
>>>
>>> The later CFID will be saved in the schema, because the schema is a
>>> normal cassandra table with last-write-wins semantics, but the first CFID
>>> might be the one that's used to create the data directory on disk, and it
>>> may have all of your data in it while you write to the table.
>>>
>>> In some cases, you'll get CFID mismatch errors on reads or writes, as
>>> the CFID in memory varies between instances.
>>> In other cases, things work fine until you restart, at which time the
>>> CFID for the table changes when you load the new schema, and data on disk
>>> isn't found.
>>>
>>> This race, unfortunately, can even occur on a single node in SOME
>>> versions of Cassandra (but not all)
>>>
>>> This is a really really really bad race in many old versions of
>>> cassandra, and a lot of the schema redesign in 4.0 is meant to solve many
>>> of these types of problems.
>>>
>>> That this continues to be possible in old versions is scary, people
>>> running old versions should not do concurrent schema changes (especially
>>> those that CREATE tables). Alternatively, you should alert if the CFID in
>>> memory doesnt match the CFID in the disk path. One could also change
>>> cassandra to use deterministic CFIDs  to avoid the race entirely (though
>>> deterministic CFIDs have a different problem, which is that DROP +
>>> re-CREATE with any host down potentially allows data on that down host to
>>> come back when the host comes back online).
>>>
>>> Stronger cluster metadata starts making this much safer, so looking
>>> forward to seeing that in future releases.
>>>
>>>
>>>
>>> On Wed, Oct 13, 2021 at 10:23 AM vytenis silgalis <[email protected]>
>>> wrote:
>>>
>>>> You ran the `alter keyspace` command on the original dc1 nodes or the
>>>> new dc2 nodes?
>>>>
>>>> On Wed, Oct 13, 2021 at 8:15 AM Stefan Miklosovic <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Tom,
>>>>>
>>>>> while I am not completely sure what might cause your issue, I just
>>>>> want to highlight that schema agreements were overhauled in 4.0 (1) a
>>>>> lot so that may be somehow related to what that ticket was trying to
>>>>> fix.
>>>>>
>>>>> Regards
>>>>>
>>>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-15158
>>>>>
>>>>> On Fri, 1 Oct 2021 at 18:43, Tom Offermann <[email protected]>
>>>>> wrote:
>>>>> >
>>>>> > When adding a datacenter to a keyspace (following the Last Pickle
>>>>> [Data Center Switch][lp] playbook), I ran into a "Configuration exception
>>>>> merging remote schema" error. The nodes in one datacenter didn't converge
>>>>> to the new schema version, and after restarting them, I saw the symptoms
>>>>> described in this Datastax article on [Fixing a table schema
>>>>> collision][ds], where there were two data directories for each table in 
>>>>> the
>>>>> keyspace on the nodes that didn't converge. I followed the recovery steps
>>>>> in the Datastax article to move the data from the older directories to the
>>>>> new directories, ran `nodetool refresh`, and that fixed the problem.
>>>>> >
>>>>> > [lp]:
>>>>> https://thelastpickle.com/blog/2019/02/26/data-center-switch.html
>>>>> > [ds]:
>>>>> https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useCreateTableCollisionFix.html
>>>>> >
>>>>> > While the Datastax article was super helpful for helping me recover,
>>>>> I'm left wondering *why* this happened. If anyone can shed some light on
>>>>> that, or offer advice on how I can avoid getting in this situation in the
>>>>> future, I would be most appreciative. I'll describe the steps I took in
>>>>> more detail in the thread.
>>>>> >
>>>>> > ## Steps
>>>>> >
>>>>> > 1. The day before, I had added the second datacenter ('dc2') to the
>>>>> system_traces, system_distributed, and system_auth keyspaces and ran
>>>>> `nodetool rebuild` for each of the 3 keyspaces. All of that went smoothly
>>>>> with no issues.
>>>>> >
>>>>> > 2. For a large keyspace, I added the second datacenter ('dc2') with
>>>>> an `ALTER KEYSPACE foo WITH replication = {'class':
>>>>> 'NetworkTopologyStrategy', 'dc1': '2', 'dc2': '3'};` statement.
>>>>> Immediately, I saw this error in the log:
>>>>> >     ```
>>>>> >     "ERROR 16:45:47 Exception in thread
>>>>> Thread[MigrationStage:1,5,main]"
>>>>> >     "org.apache.cassandra.exceptions.ConfigurationException: Column
>>>>> family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected
>>>>> 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>>>>> >     "\tat
>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:594)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>> ~[na:1.8.0_232]"
>>>>> >     "\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>> ~[na:1.8.0_232]"
>>>>> >     "\tat
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>> ~[na:1.8.0_232]"
>>>>> >     "\tat
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>> [na:1.8.0_232]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]"
>>>>> >     ```
>>>>> >
>>>>> >     I also saw this:
>>>>> >     ```
>>>>> >     "ERROR 16:46:48 Configuration exception merging remote schema"
>>>>> >     "org.apache.cassandra.exceptions.ConfigurationException: Column
>>>>> family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected
>>>>> 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>>>>> >     "\tat
>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:91)
>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat 
>>>>> > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53)
>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat 
>>>>> > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat
>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>> [na:1.8.0_232]"
>>>>> >     "\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>>> [na:1.8.0_232]"
>>>>> >     "\tat
>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>> [na:1.8.0_232]"
>>>>> >     "\tat
>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>> [na:1.8.0_232]"
>>>>> >     "\tat
>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>> >     "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]"
>>>>> >     ```
>>>>> >     This error repeated several times over the next 2 minutes.
>>>>> >
>>>>> > 3. While running `nodetool describecluster` repeatedly, I saw that
>>>>> the nodes in the 'dc2' datacenter converged to the new schema version
>>>>> quickly, but the nodes in the original datacenter ('dc1') remained at the
>>>>> previous schema version.
>>>>> >
>>>>> > 4. I waited to see if all of the nodes would converge to the new
>>>>> schema version, but they still hadn't converged after roughly 10 minutes.
>>>>> Given the errors I saw, I wasn't optimistic it would work out all by
>>>>> itself, so I decided to restart the nodes in the 'dc1' datacenter one at a
>>>>> time so they would restart with the latest schema version.
>>>>> >
>>>>> > 5. After each node restarted, `nodetool describecluster` showed it
>>>>> as being on the latest schema version. So, after getting through all the
>>>>> 'dc1' nodes, it looked like everything in the cluster was healthy again.
>>>>> >
>>>>> > 6. However, that's when I noticed that there were two data
>>>>> directories on disk for each table in the keyspace. New writes for a table
>>>>> were being saved in the newer directory, but queries for data saved in the
>>>>> old data directory were returning no results.
>>>>> >
>>>>> > 7. That's when I followed the recovery steps in the Datastax article
>>>>> with great success.
>>>>> >
>>>>> > ## Questions
>>>>> >
>>>>> > * My understanding is that running concurrent schema updates should
>>>>> always be avoided, since that can result in schema collisions. But, in 
>>>>> this
>>>>> case, I wasn't performing multiple schema updates. I was just running a
>>>>> single `ALTER KEYSPACE` statement. Any idea why a single schema update
>>>>> would result in a schema collision and two data directories per table?
>>>>> >
>>>>> > * Should I have waited longer before restarting nodes? Perhaps,
>>>>> given enough time, the Cassandra nodes would have all converged on the
>>>>> correct schema version, and this would have resolved on it's own?
>>>>> >
>>>>> > * Any suggestions for how I can avoid this problem in the future?
>>>>> >
>>>>> > --
>>>>> > Tom Offermann
>>>>> > Lead Software Engineer
>>>>> > http://newrelic.com
>>>>>
>>>>
>>
>> --
>> Tom Offermann
>> Lead Software Engineer
>> http://newrelic.com
>>
>

-- 
Tom Offermann
Lead Software Engineer
http://newrelic.com

Re: Schema collision results in multiple data directories per table

Reply via email to