Your missing keyspace problem has nothing to do with that bug.
In that case, the same table was created twice in a very short period of
time, and I suspect that was done concurrently on two different nodes.
The evidence lies in the two CF IDs - bd7200a0156711e88974855d74ee356f
and bd750de0156711e8bdc54f7bcdcb851f, which are created at
2018-02-19T11:26:33.898 and 2018-02-19T11:26:33.918 respectively, with a
merely 20 milliseconds gap between them.
TBH, It doesn't sound like a bug to me. Cassandra is eventually
consistent by design, and two conflicting schema changes on two
different nodes at nearly the same time will likely result in schema
disagreement and Cassandra will eventually reach agreement again, and
possibly discarding one of the conflicting schema change, together with
all data written to the discarded table/columns. To make sure this
doesn't happen to your data, you should avoid doing multiple schema
changes to the same keyspace (for create/alter/... keyspace) or same
table (for create/alter/... table) on two or more Cassandra coordinator
nodes in a very short period of time. Instead, send all your schema
change queries to the same coordinator node, or if that's not possible,
wait for at least 30 seconds between two schema changes and make sure
you aren't restarting any node at the same time.
On 01/03/2021 14:04, Marco Gasparini wrote:
actually I found a lot of .db files in the following directory:
/var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-mytable
I also found this:
2021-03-01 06:08:08,864 INFO
[Native-Transport-Requests-1] MigrationManager.java:542
announceKeyspaceDrop Drop Keyspace 'mykeyspace'
so I think that you, @erick and @bowen, are right. Something dropped
the keyspace.
I will try to follow your procedure @bowen, thank you very much!
Do you know what could cause this issue?
It seems like a big issue. I found this bug
https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel
<https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel>,
maybe they are correlated...
Thank you @Bowen and @Erick
Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song
<bo...@bso.ng.invalid> ha scritto:
The warning message indicates the node y.y.y.y went down (or is
unreachable via network) before 2021-02-28 05:17:33. Is there any
chance you can find the log file on that node at around or before
that time? It may show why did that node go down. The reason of
that might be irrelevant to the missing keyspace, but still worth
to have a look in order to prevent the same thing from happening
again.
As Erick said, the table's CF ID isn't new, so it's unlikely to be
a schema synchronization issue. Therefore I also suspect the
keyspace was accidentally dropped. Cassandra only logs "Drop
Keyspace 'keyspace_name'" on the node that received the "DROP
KEYSPACE ..." query, so you may have to search this in log files
from all nodes to find it.
Assuming the keyspace was dropped but you still have the SSTable
files, you can recover the data by re-creating the keyspace and
tables with identical replication strategy and schema, then copy
the SSTable files to the corresponding new table directories (with
different CF ID suffixes) on the same node, and finally run
"nodetool refresh ..." or restart the node. Since you don't yet
have a full backup, I strongly recommend you to make a backup, and
ideally test restoring it to a different cluster, before
attempting to do this.
On 01/03/2021 11:48, Marco Gasparini wrote:
here the previous error:
2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
validateAndConnectIfNeeded failed to connect to node
{y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y
}{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y
][ y.y.y.y :9300] connect_timeout[30s]
at
org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
at
org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
at
org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
at
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
at
org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
at
org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
at
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
at
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Yes this node (y.y.y.y) stopped because it went out of disk space.
I said "deleted" because I'm not a native english speaker :)
I usually "remove" snapshots via 'nodetool clearsnapshot' or
cassandra-reaper user interface.
Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song
<bo...@bso.ng.invalid> <mailto:bo...@bso.ng.invalid> ha scritto:
What was the warning? Is it related to the disk failure
policy? Could you please share the relevant log? You can edit
it and redact the sensitive information before sharing it.
Also, I can't help to notice that you used the word "delete"
(instead of "clear") to describe the process of removing
snapshots. May I ask how did you delete the snapshots? Was it
"nodetool clearsnapshot ...", "rm -rf ..." or something else?
On 01/03/2021 11:27, Marco Gasparini wrote:
thanks Bowen for answering
Actually, I checked the server log and the only warning was
that a node went offline.
No, I have no backups or snapshots.
In the meantime I found that probably Cassandra moved all
files from a directory to the snapshot directory. I am
pretty sure of that because I have recently deleted all the
snapshots I made because it was going out of disk space and
I found this very directory full of files where the
modification timestamp was the same as the first error I got
in the log.
Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song
<bo...@bso.ng.invalid> <mailto:bo...@bso.ng.invalid> ha scritto:
The first thing I'd check is the server log. The log may
contain vital information about the cause of it, and
that there may be different ways to recover from it
depending on the cause.
Also, please allow me to ask a seemingly obvious
question, do you have a backup?
On 01/03/2021 09:34, Marco Gasparini wrote:
hello everybody,
This morning, Monday!!!, I was checking on Cassandra
cluster and I noticed that all data was missing. I
noticed the following error on each node (9 nodes in
the cluster):
*2021-03-01 09:05:52,984 WARN
[MessagingService-Incoming-/x.x.x.x]
IncomingTcpConnection.java:103 run
UnknownColumnFamilyException reading from socket; closing
org.apache.cassandra.db.UnknownColumnFamilyException:
Couldn't find table for cfId
cba90a70-5c46-11e9-9e36-f54fe3235e69. If a table was
just created, this is likely due to the schema not
being fully propagated. Please wait for schema
agreement on table creation.
at
org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
at
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
at
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
at
org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
at
org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
at
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
at
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
*
*
I tried to query the keyspace and got this:
node1# cqlsh
Connected to Cassandra Cluster at x.x.x.x:9042.
[cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 |
Native protocol v4]
Use HELP for help.
cqlsh> select * from mykeyspace.mytable where id = 123935;
*InvalidRequest: Error from server: code=2200 [Invalid
query] message="Keyspace * *mykeyspace does not exist"*
*
*
Investigating on each node I found that all the
*SStables exist*, so I think data is still there but
the keyspace vanished, "magically".
Other facts I can tell you are:
* I have been getting Anticompaction errors from 2
nodes due to the fact the disk was almost full.
* the cluster was online friday
* this morning, Monday, the whole cluster was offline
and I noticed the problem of "missing keyspace"
* During the weekend the cluster has been subject to
inserts and deletes
* I have a 9 node (HDD) Cassandra 3.11 cluster.
I really need help on this, how can I restore the cluster?
Thank you very much
Marco
*
*
*
*
*
*
*
*
*
*