>
> Any possibility that you "merged" two clusters together?


Ooohh...I think that's the missing piece of this puzzle! A couple weeks
earlier, prior to the problem described in this thread, we did
inadvertently merge two clusters together. We merged the original 'dc1'
cluster with an entirely different 'dc2' cluster.

When adding a new datacenter, it is important that the new datacenter be in
its default state, with no schema loaded. But, this time we had used our
normal Cassandra cluster building automation, which also loads in a schema
and creates user credentials. (No surprise that we ran this automation on
Aug 5th.) That set up a situation where the 2 datacenters had schemas that
differed in one crucial way: each defined the NetworkTopology for their own
datacenter. Nodes in one datacenter had a schema that used 'dc1', while the
other datacenter nodes used 'dc2'.

Then, when we joined the clusters and they began gossiping, this set up the
problematic situation Jeff described, where conflicting schema changes are
happening concurrently.

Oops.

We unwound this situation by reverting back to a single-datacenter
configuration, where we deleted the 'dc2' nodes and altered the schema so
that the original cluster used a NetworkTopology with 'dc1'.

I didn't think to mention this background in my original post, because I
thought we had returned the cluster back to its original, error-free
working state before we joined a new 'dc2' cluster and ran the `ALTER
KEYSPACE` statement I described. But, I now suspect this earlier
inadvertent cluster merging was the real cause of the problem we saw. If
the concurrent schema changes during the cluster merging caused the on-disk
and in-memory schemas to diverge, then running the later `ALTER KEYSPACE`
statement didn't cause the errors we saw. Instead, it just triggered the
error that was already latent.

While I feel a little sheepish about goofing this in the first place,
knowing that this is most likely the cause of the problem we saw does give
me more confidence moving forward. I was worried that any single, isolated
schema change could cause a potential race condition and a table schema
collision. But, that no longer seems to be the case. In the future, we
should be fine, so long as we: 1) Avoid concurrent schema changes, and 2)
Avoid loading a schema when building a cluster that we are going to add as
a datacenter to an existing cluster.

(I should add that all of these events have taken place in a non-production
environment. No customers have been impacted by any of these shenanigans!)

Erick, one last question: Is there a quick and easy way to extract the date
from a time UUID? I ended up inserting it into a table and then querying it
with `dateOf()`:

```
cassandra@cqlsh> CREATE TABLE ts.timestamps (
   ...    id int,
   ...    ts timeuuid,
   ...    PRIMARY KEY (id)
   ... );

cassandra@cqlsh> INSERT INTO ts.timestamps (id, ts) VALUES (1,
8ad72660-f629-11eb-a217-e1a09d8bc60c);

cassandra@cqlsh> select dateOf(ts) from ts.timestamps where id = 1;

 system.dateof(ts)
---------------------------------
 2021-08-05 20:13:04.838000+0000
```

Is there a better/faster way to do this?

Once again, Jeff and Erick, thanks for all of your help!

--Tom


On Fri, Oct 15, 2021 at 4:05 PM Erick Ramirez <erick.rami...@datastax.com>
wrote:

> I agree with Jeff that this isn't related to ALTER TABLE. FWIW, the
> original table was created in 2017 but a new version got created on August
> 5:
>
>    - 20739eb0-d92e-11e6-b42f-e7eb6f21c481 - Friday, January 13, 2017 at
>    1:18:01 GMT
>    - 8ad72660-f629-11eb-a217-e1a09d8bc60c - Thursday, August 5, 2021 at
>    20:13:04 GMT
>
> Would that have been when you added the new nodes? Any possibility that
> you "merged" two clusters together?
>
>>

-- 
Tom Offermann
Lead Software Engineer
http://newrelic.com

Reply via email to