Re: Schema collision results in multiple data directories per table

Jeff Jirsa Fri, 15 Oct 2021 15:52:08 -0700

Consistency doesnt matter for schema.

For every host: " select id from system_schema tables WHERE keyspace_name=?
and table_name=?" (
https://github.com/apache/cassandra/blob/cassandra-3.11/src/java/org/apache/cassandra/schema/SchemaKeyspace.java#L144
)


Then, compare that to the /path/to/data/keyspace/table-(id)/ on disk

If any of those dont match, you've got a problem waiting to bite you on
next restart.



On Fri, Oct 15, 2021 at 3:48 PM Tom Offermann <tofferm...@newrelic.com>
wrote:

> So, if I were to do `CONSISTENCY ALL; select *` from each of the
> system_schema tables, then on-disk and in-memory should be in sync?
>
> On Fri, Oct 15, 2021 at 3:38 PM Jeff Jirsa <jji...@gmail.com> wrote:
>
>> Heap dumps + filesystem inspection + SELECT from schema tables.
>>
>>
>> On Fri, Oct 15, 2021 at 3:28 PM Tom Offermann <tofferm...@newrelic.com>
>> wrote:
>>
>>> Interesting!
>>>
>>> Is there a way to determine if the on-disk schema and the in-memory
>>> schema are in sync? Is there a way to force them to sync? If so, would it
>>> help to force a sync before running an `ALTER KEYSPACE` schema change?
>>>
>>> On Fri, Oct 15, 2021 at 3:08 PM Jeff Jirsa <jji...@gmail.com> wrote:
>>>
>>>> I would not expect an ALTER KEYSPACE to introduce a divergent CFID,
>>>> that usually happens during a CREATE TABLE. With no other evidence or
>>>> ability to debug, I would guess that the CFIDs diverged previously, but due
>>>> to the race(s) I described, the on-disk schema and the in-memory schema
>>>> differed, and the ALTER KEYSPACE forces the schema from one host to be
>>>> serialized and forced to the others, where the actual IDs get reconciled.
>>>>
>>>> You may be able to confirm/demonstrate that by looking at the
>>>> timestamps on the data directories across all of the hosts in the cluster?
>>>>
>>>>
>>>>
>>>> On Fri, Oct 15, 2021 at 3:02 PM Tom Offermann <tofferm...@newrelic.com>
>>>> wrote:
>>>>
>>>>> Jeff,
>>>>>
>>>>> Thanks for describing the race condition.
>>>>>
>>>>> I understand that performing concurrent schema changes is dangerous,
>>>>> and that running an `ALTER KEYSPACE` on one node, and then running another
>>>>> `ALTER KEYSPACE` on a different node, before the first has fully 
>>>>> propagated
>>>>> throughout the cluster, can lead to schema collisions.
>>>>>
>>>>> But, can running a single `ALTER KEYSPACE` on a single node also be
>>>>> vulnerable to this race condition?
>>>>>
>>>>> We were careful to make sure that all nodes in both datacenters were
>>>>> on the same schema version ID by checking the output of `nodetool
>>>>> describecluster`. Since all nodes were in agreement, I figured that I had
>>>>> ruled out the possibility of concurrent schema changes.
>>>>>
>>>>> As I mentioned, on the day before, we did run 3 different `ALTER
>>>>> KEYSPACE` schema changes (to add 'dc2' to system_traces,
>>>>> system_distributed, and system_auth) and also ran `nodetool rebuild` for
>>>>> each of the 3 keyspaces. Is it possible that one or more of these schema
>>>>> changes hadn't fully propagated 24 hours later, even though `nodetool
>>>>> describecluster` showed all nodes as being on the same schema version? Is
>>>>> there a better way to determine that I am not inadvertently issuing
>>>>> concurrent schema changes?
>>>>>
>>>>> I'm also curious about how CFIDs are generated and when new ones are
>>>>> generated. What I've noticed is that when I successfully run `ALTER
>>>>> KEYSPACE` to add a datacenter with no errors (and make no other schema
>>>>> changes), then the table IDs in `system_schema.tables` remain unchanged.
>>>>> But, when we saw the schema collision that I described in this thread, 
>>>>> that
>>>>> resulted in new table IDs in `system_schema.tables`. Why do these table 
>>>>> IDs
>>>>> normally remain unchanged? What caused new ones to be generated in the
>>>>> error case I described?
>>>>>
>>>>> --Tom
>>>>>
>>>>> On Wed, Oct 13, 2021 at 10:35 AM Jeff Jirsa <jji...@gmail.com> wrote:
>>>>>
>>>>>> I've described this race a few times on the list. It is very very
>>>>>> dangerous to do concurrent table creation in cassandra with
>>>>>> non-determistic CFIDs.
>>>>>>
>>>>>> I'll try to describe it quickly right now:
>>>>>> - Imagine you have 3 hosts, A B and C
>>>>>>
>>>>>> You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS".
>>>>>> A allocates a CFID (which is a UUID, which includes a high resolution
>>>>>> timestamp), starts adjusting it's schema
>>>>>> Before it can finish that schema, you connect to B and issue the same
>>>>>> CREATE TABLE statement
>>>>>> B allocates a DIFFERENT CFID, and starts adjusting its schema
>>>>>>
>>>>>> A and B both have a CFID, which they will use to make a data
>>>>>> directory on disk, and which they will push/pull to the rest of the 
>>>>>> cluster
>>>>>> through schema propagation.
>>>>>>
>>>>>> The later CFID will be saved in the schema, because the schema is a
>>>>>> normal cassandra table with last-write-wins semantics, but the first CFID
>>>>>> might be the one that's used to create the data directory on disk, and it
>>>>>> may have all of your data in it while you write to the table.
>>>>>>
>>>>>> In some cases, you'll get CFID mismatch errors on reads or writes, as
>>>>>> the CFID in memory varies between instances.
>>>>>> In other cases, things work fine until you restart, at which time the
>>>>>> CFID for the table changes when you load the new schema, and data on disk
>>>>>> isn't found.
>>>>>>
>>>>>> This race, unfortunately, can even occur on a single node in SOME
>>>>>> versions of Cassandra (but not all)
>>>>>>
>>>>>> This is a really really really bad race in many old versions of
>>>>>> cassandra, and a lot of the schema redesign in 4.0 is meant to solve many
>>>>>> of these types of problems.
>>>>>>
>>>>>> That this continues to be possible in old versions is scary, people
>>>>>> running old versions should not do concurrent schema changes (especially
>>>>>> those that CREATE tables). Alternatively, you should alert if the CFID in
>>>>>> memory doesnt match the CFID in the disk path. One could also change
>>>>>> cassandra to use deterministic CFIDs  to avoid the race entirely (though
>>>>>> deterministic CFIDs have a different problem, which is that DROP +
>>>>>> re-CREATE with any host down potentially allows data on that down host to
>>>>>> come back when the host comes back online).
>>>>>>
>>>>>> Stronger cluster metadata starts making this much safer, so looking
>>>>>> forward to seeing that in future releases.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Oct 13, 2021 at 10:23 AM vytenis silgalis <
>>>>>> vsilga...@gmail.com> wrote:
>>>>>>
>>>>>>> You ran the `alter keyspace` command on the original dc1 nodes or
>>>>>>> the new dc2 nodes?
>>>>>>>
>>>>>>> On Wed, Oct 13, 2021 at 8:15 AM Stefan Miklosovic <
>>>>>>> stefan.mikloso...@instaclustr.com> wrote:
>>>>>>>
>>>>>>>> Hi Tom,
>>>>>>>>
>>>>>>>> while I am not completely sure what might cause your issue, I just
>>>>>>>> want to highlight that schema agreements were overhauled in 4.0 (1)
>>>>>>>> a
>>>>>>>> lot so that may be somehow related to what that ticket was trying to
>>>>>>>> fix.
>>>>>>>>
>>>>>>>> Regards
>>>>>>>>
>>>>>>>> (1) https://issues.apache.org/jira/browse/CASSANDRA-15158
>>>>>>>>
>>>>>>>> On Fri, 1 Oct 2021 at 18:43, Tom Offermann <tofferm...@newrelic.com>
>>>>>>>> wrote:
>>>>>>>> >
>>>>>>>> > When adding a datacenter to a keyspace (following the Last Pickle
>>>>>>>> [Data Center Switch][lp] playbook), I ran into a "Configuration 
>>>>>>>> exception
>>>>>>>> merging remote schema" error. The nodes in one datacenter didn't 
>>>>>>>> converge
>>>>>>>> to the new schema version, and after restarting them, I saw the 
>>>>>>>> symptoms
>>>>>>>> described in this Datastax article on [Fixing a table schema
>>>>>>>> collision][ds], where there were two data directories for each table 
>>>>>>>> in the
>>>>>>>> keyspace on the nodes that didn't converge. I followed the recovery 
>>>>>>>> steps
>>>>>>>> in the Datastax article to move the data from the older directories to 
>>>>>>>> the
>>>>>>>> new directories, ran `nodetool refresh`, and that fixed the problem.
>>>>>>>> >
>>>>>>>> > [lp]:
>>>>>>>> https://thelastpickle.com/blog/2019/02/26/data-center-switch.html
>>>>>>>> > [ds]:
>>>>>>>> https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useCreateTableCollisionFix.html
>>>>>>>> >
>>>>>>>> > While the Datastax article was super helpful for helping me
>>>>>>>> recover, I'm left wondering *why* this happened. If anyone can shed 
>>>>>>>> some
>>>>>>>> light on that, or offer advice on how I can avoid getting in this 
>>>>>>>> situation
>>>>>>>> in the future, I would be most appreciative. I'll describe the steps I 
>>>>>>>> took
>>>>>>>> in more detail in the thread.
>>>>>>>> >
>>>>>>>> > ## Steps
>>>>>>>> >
>>>>>>>> > 1. The day before, I had added the second datacenter ('dc2') to
>>>>>>>> the system_traces, system_distributed, and system_auth keyspaces and 
>>>>>>>> ran
>>>>>>>> `nodetool rebuild` for each of the 3 keyspaces. All of that went 
>>>>>>>> smoothly
>>>>>>>> with no issues.
>>>>>>>> >
>>>>>>>> > 2. For a large keyspace, I added the second datacenter ('dc2')
>>>>>>>> with an `ALTER KEYSPACE foo WITH replication = {'class':
>>>>>>>> 'NetworkTopologyStrategy', 'dc1': '2', 'dc2': '3'};` statement.
>>>>>>>> Immediately, I saw this error in the log:
>>>>>>>> >     ```
>>>>>>>> >     "ERROR 16:45:47 Exception in thread
>>>>>>>> Thread[MigrationStage:1,5,main]"
>>>>>>>> >     "org.apache.cassandra.exceptions.ConfigurationException:
>>>>>>>> Column family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c;
>>>>>>>> expected 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:594)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>>>>> ~[na:1.8.0_232]"
>>>>>>>> >     "\tat
>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>>>>>>>> ~[na:1.8.0_232]"
>>>>>>>> >     "\tat
>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>> ~[na:1.8.0_232]"
>>>>>>>> >     "\tat
>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>> [na:1.8.0_232]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]"
>>>>>>>> >     ```
>>>>>>>> >
>>>>>>>> >     I also saw this:
>>>>>>>> >     ```
>>>>>>>> >     "ERROR 16:46:48 Configuration exception merging remote schema"
>>>>>>>> >     "org.apache.cassandra.exceptions.ConfigurationException:
>>>>>>>> Column family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c;
>>>>>>>> expected 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:91)
>>>>>>>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat 
>>>>>>>> > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53)
>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat 
>>>>>>>> > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat
>>>>>>>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>>>>>>>> [na:1.8.0_232]"
>>>>>>>> >     "\tat
>>>>>>>> java.util.concurrent.FutureTask.run(FutureTask.java:266) 
>>>>>>>> [na:1.8.0_232]"
>>>>>>>> >     "\tat
>>>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>>>>>>>> [na:1.8.0_232]"
>>>>>>>> >     "\tat
>>>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>>>>>>>> [na:1.8.0_232]"
>>>>>>>> >     "\tat
>>>>>>>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>>>>>>>> [apache-cassandra-3.11.5.jar:3.11.5]"
>>>>>>>> >     "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]"
>>>>>>>> >     ```
>>>>>>>> >     This error repeated several times over the next 2 minutes.
>>>>>>>> >
>>>>>>>> > 3. While running `nodetool describecluster` repeatedly, I saw
>>>>>>>> that the nodes in the 'dc2' datacenter converged to the new schema 
>>>>>>>> version
>>>>>>>> quickly, but the nodes in the original datacenter ('dc1') remained at 
>>>>>>>> the
>>>>>>>> previous schema version.
>>>>>>>> >
>>>>>>>> > 4. I waited to see if all of the nodes would converge to the new
>>>>>>>> schema version, but they still hadn't converged after roughly 10 
>>>>>>>> minutes.
>>>>>>>> Given the errors I saw, I wasn't optimistic it would work out all by
>>>>>>>> itself, so I decided to restart the nodes in the 'dc1' datacenter one 
>>>>>>>> at a
>>>>>>>> time so they would restart with the latest schema version.
>>>>>>>> >
>>>>>>>> > 5. After each node restarted, `nodetool describecluster` showed
>>>>>>>> it as being on the latest schema version. So, after getting through 
>>>>>>>> all the
>>>>>>>> 'dc1' nodes, it looked like everything in the cluster was healthy 
>>>>>>>> again.
>>>>>>>> >
>>>>>>>> > 6. However, that's when I noticed that there were two data
>>>>>>>> directories on disk for each table in the keyspace. New writes for a 
>>>>>>>> table
>>>>>>>> were being saved in the newer directory, but queries for data saved in 
>>>>>>>> the
>>>>>>>> old data directory were returning no results.
>>>>>>>> >
>>>>>>>> > 7. That's when I followed the recovery steps in the Datastax
>>>>>>>> article with great success.
>>>>>>>> >
>>>>>>>> > ## Questions
>>>>>>>> >
>>>>>>>> > * My understanding is that running concurrent schema updates
>>>>>>>> should always be avoided, since that can result in schema collisions. 
>>>>>>>> But,
>>>>>>>> in this case, I wasn't performing multiple schema updates. I was just
>>>>>>>> running a single `ALTER KEYSPACE` statement. Any idea why a single 
>>>>>>>> schema
>>>>>>>> update would result in a schema collision and two data directories per
>>>>>>>> table?
>>>>>>>> >
>>>>>>>> > * Should I have waited longer before restarting nodes? Perhaps,
>>>>>>>> given enough time, the Cassandra nodes would have all converged on the
>>>>>>>> correct schema version, and this would have resolved on it's own?
>>>>>>>> >
>>>>>>>> > * Any suggestions for how I can avoid this problem in the future?
>>>>>>>> >
>>>>>>>> > --
>>>>>>>> > Tom Offermann
>>>>>>>> > Lead Software Engineer
>>>>>>>> > http://newrelic.com
>>>>>>>>
>>>>>>>
>>>>>
>>>>> --
>>>>> Tom Offermann
>>>>> Lead Software Engineer
>>>>> http://newrelic.com
>>>>>
>>>>
>>>
>>> --
>>> Tom Offermann
>>> Lead Software Engineer
>>> http://newrelic.com
>>>
>>
>
> --
> Tom Offermann
> Lead Software Engineer
> http://newrelic.com
>

Re: Schema collision results in multiple data directories per table

Reply via email to