Re: Schema collision results in multiple data directories per table

Jeff Jirsa Wed, 13 Oct 2021 10:35:01 -0700

I've described this race a few times on the list. It is very very dangerous
to do concurrent table creation in cassandra with non-determistic CFIDs.


I'll try to describe it quickly right now:
- Imagine you have 3 hosts, A B and C

You connect to A and issue a "CREATE TABLE ... IF NOT EXISTS".
A allocates a CFID (which is a UUID, which includes a high resolution
timestamp), starts adjusting it's schema
Before it can finish that schema, you connect to B and issue the same
CREATE TABLE statement
B allocates a DIFFERENT CFID, and starts adjusting its schema

A and B both have a CFID, which they will use to make a data directory on
disk, and which they will push/pull to the rest of the cluster through
schema propagation.

The later CFID will be saved in the schema, because the schema is a normal
cassandra table with last-write-wins semantics, but the first CFID might be
the one that's used to create the data directory on disk, and it may have
all of your data in it while you write to the table.

In some cases, you'll get CFID mismatch errors on reads or writes, as the
CFID in memory varies between instances.
In other cases, things work fine until you restart, at which time the CFID
for the table changes when you load the new schema, and data on disk isn't
found.

This race, unfortunately, can even occur on a single node in SOME versions
of Cassandra (but not all)

This is a really really really bad race in many old versions of cassandra,
and a lot of the schema redesign in 4.0 is meant to solve many of these
types of problems.

That this continues to be possible in old versions is scary, people running
old versions should not do concurrent schema changes (especially those that
CREATE tables). Alternatively, you should alert if the CFID in memory
doesnt match the CFID in the disk path. One could also change cassandra to
use deterministic CFIDs  to avoid the race entirely (though deterministic
CFIDs have a different problem, which is that DROP + re-CREATE with any
host down potentially allows data on that down host to come back when the
host comes back online).

Stronger cluster metadata starts making this much safer, so looking forward
to seeing that in future releases.



On Wed, Oct 13, 2021 at 10:23 AM vytenis silgalis <vsilga...@gmail.com>
wrote:

> You ran the `alter keyspace` command on the original dc1 nodes or the new
> dc2 nodes?
>
> On Wed, Oct 13, 2021 at 8:15 AM Stefan Miklosovic <
> stefan.mikloso...@instaclustr.com> wrote:
>
>> Hi Tom,
>>
>> while I am not completely sure what might cause your issue, I just
>> want to highlight that schema agreements were overhauled in 4.0 (1) a
>> lot so that may be somehow related to what that ticket was trying to
>> fix.
>>
>> Regards
>>
>> (1) https://issues.apache.org/jira/browse/CASSANDRA-15158
>>
>> On Fri, 1 Oct 2021 at 18:43, Tom Offermann <tofferm...@newrelic.com>
>> wrote:
>> >
>> > When adding a datacenter to a keyspace (following the Last Pickle [Data
>> Center Switch][lp] playbook), I ran into a "Configuration exception merging
>> remote schema" error. The nodes in one datacenter didn't converge to the
>> new schema version, and after restarting them, I saw the symptoms described
>> in this Datastax article on [Fixing a table schema collision][ds], where
>> there were two data directories for each table in the keyspace on the nodes
>> that didn't converge. I followed the recovery steps in the Datastax article
>> to move the data from the older directories to the new directories, ran
>> `nodetool refresh`, and that fixed the problem.
>> >
>> > [lp]: https://thelastpickle.com/blog/2019/02/26/data-center-switch.html
>> > [ds]:
>> https://docs.datastax.com/en/dse/6.0/cql/cql/cql_using/useCreateTableCollisionFix.html
>> >
>> > While the Datastax article was super helpful for helping me recover,
>> I'm left wondering *why* this happened. If anyone can shed some light on
>> that, or offer advice on how I can avoid getting in this situation in the
>> future, I would be most appreciative. I'll describe the steps I took in
>> more detail in the thread.
>> >
>> > ## Steps
>> >
>> > 1. The day before, I had added the second datacenter ('dc2') to the
>> system_traces, system_distributed, and system_auth keyspaces and ran
>> `nodetool rebuild` for each of the 3 keyspaces. All of that went smoothly
>> with no issues.
>> >
>> > 2. For a large keyspace, I added the second datacenter ('dc2') with an
>> `ALTER KEYSPACE foo WITH replication = {'class': 'NetworkTopologyStrategy',
>> 'dc1': '2', 'dc2': '3'};` statement. Immediately, I saw this error in the
>> log:
>> >     ```
>> >     "ERROR 16:45:47 Exception in thread Thread[MigrationStage:1,5,main]"
>> >     "org.apache.cassandra.exceptions.ConfigurationException: Column
>> family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected
>> 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>> >     "\tat
>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.service.MigrationManager$1.runMayThrow(MigrationManager.java:594)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:28)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> ~[na:1.8.0_232]"
>> >     "\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> ~[na:1.8.0_232]"
>> >     "\tat
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> ~[na:1.8.0_232]"
>> >     "\tat
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> [na:1.8.0_232]"
>> >     "\tat
>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>> [apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]"
>> >     ```
>> >
>> >     I also saw this:
>> >     ```
>> >     "ERROR 16:46:48 Configuration exception merging remote schema"
>> >     "org.apache.cassandra.exceptions.ConfigurationException: Column
>> family ID mismatch (found 8ad72660-f629-11eb-a217-e1a09d8bc60c; expected
>> 20739eb0-d92e-11e6-b42f-e7eb6f21c481)"
>> >     "\tat
>> org.apache.cassandra.config.CFMetaData.validateCompatibility(CFMetaData.java:949)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.config.CFMetaData.apply(CFMetaData.java:903)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.config.Schema.updateTable(Schema.java:687)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.schema.SchemaKeyspace.updateKeyspace(SchemaKeyspace.java:1482)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1438)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchema(SchemaKeyspace.java:1407)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.schema.SchemaKeyspace.mergeSchemaAndAnnounceVersion(SchemaKeyspace.java:1384)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> org.apache.cassandra.service.MigrationTask$1.response(MigrationTask.java:91)
>> ~[apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat 
>> > org.apache.cassandra.net.ResponseVerbHandler.doVerb(ResponseVerbHandler.java:53)
>> [apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat 
>> > org.apache.cassandra.net.MessageDeliveryTask.run(MessageDeliveryTask.java:66)
>> [apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat
>> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>> [na:1.8.0_232]"
>> >     "\tat java.util.concurrent.FutureTask.run(FutureTask.java:266)
>> [na:1.8.0_232]"
>> >     "\tat
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>> [na:1.8.0_232]"
>> >     "\tat
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>> [na:1.8.0_232]"
>> >     "\tat
>> org.apache.cassandra.concurrent.NamedThreadFactory.lambda$threadLocalDeallocator$0(NamedThreadFactory.java:84)
>> [apache-cassandra-3.11.5.jar:3.11.5]"
>> >     "\tat java.lang.Thread.run(Thread.java:748) ~[na:1.8.0_232]"
>> >     ```
>> >     This error repeated several times over the next 2 minutes.
>> >
>> > 3. While running `nodetool describecluster` repeatedly, I saw that the
>> nodes in the 'dc2' datacenter converged to the new schema version quickly,
>> but the nodes in the original datacenter ('dc1') remained at the previous
>> schema version.
>> >
>> > 4. I waited to see if all of the nodes would converge to the new schema
>> version, but they still hadn't converged after roughly 10 minutes. Given
>> the errors I saw, I wasn't optimistic it would work out all by itself, so I
>> decided to restart the nodes in the 'dc1' datacenter one at a time so they
>> would restart with the latest schema version.
>> >
>> > 5. After each node restarted, `nodetool describecluster` showed it as
>> being on the latest schema version. So, after getting through all the 'dc1'
>> nodes, it looked like everything in the cluster was healthy again.
>> >
>> > 6. However, that's when I noticed that there were two data directories
>> on disk for each table in the keyspace. New writes for a table were being
>> saved in the newer directory, but queries for data saved in the old data
>> directory were returning no results.
>> >
>> > 7. That's when I followed the recovery steps in the Datastax article
>> with great success.
>> >
>> > ## Questions
>> >
>> > * My understanding is that running concurrent schema updates should
>> always be avoided, since that can result in schema collisions. But, in this
>> case, I wasn't performing multiple schema updates. I was just running a
>> single `ALTER KEYSPACE` statement. Any idea why a single schema update
>> would result in a schema collision and two data directories per table?
>> >
>> > * Should I have waited longer before restarting nodes? Perhaps, given
>> enough time, the Cassandra nodes would have all converged on the correct
>> schema version, and this would have resolved on it's own?
>> >
>> > * Any suggestions for how I can avoid this problem in the future?
>> >
>> > --
>> > Tom Offermann
>> > Lead Software Engineer
>> > http://newrelic.com
>>
>

Re: Schema collision results in multiple data directories per table

Reply via email to