Your missing keyspace problem has nothing to do with that bug.

In that case, the same table was created twice in a very short period of time, and I suspect that was done concurrently on two different nodes. The evidence lies in the two CF IDs - bd7200a0156711e88974855d74ee356f and bd750de0156711e8bdc54f7bcdcb851f, which are created at 2018-02-19T11:26:33.898 and 2018-02-19T11:26:33.918 respectively, with a merely 20 milliseconds gap between them.

TBH, It doesn't sound like a bug to me. Cassandra is eventually consistent by design, and two conflicting schema changes on two different nodes at nearly the same time will likely result in schema disagreement and Cassandra will eventually reach agreement again, and possibly discarding one of the conflicting schema change, together with all data written to the discarded table/columns. To make sure this doesn't happen to your data, you should avoid doing multiple schema changes to the same keyspace (for create/alter/... keyspace) or same table (for create/alter/... table) on two or more Cassandra coordinator nodes in a very short period of time. Instead, send all your schema change queries to the same coordinator node, or if that's not possible, wait for at least 30 seconds between two schema changes and make sure you aren't restarting any node at the same time.


On 01/03/2021 14:04, Marco Gasparini wrote:
actually I found a lot of .db files in the following directory:
/var/lib/cassandra/data/mykespace/mytable-2795c0204a2d11e9aba361828766468f/snapshots/dropped-1614575293790-mytable

I also found this:
             2021-03-01 06:08:08,864 INFO  [Native-Transport-Requests-1] MigrationManager.java:542 announceKeyspaceDrop Drop Keyspace 'mykeyspace'

so I think that you, @erick and @bowen, are right. Something dropped the keyspace.

I will try to follow your procedure @bowen, thank you very much!

Do you know what could cause this issue?
It seems like a big issue. I found this bug https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel <https://issues.apache.org/jira/browse/CASSANDRA-14957?page=com.atlassian.jira.plugin.system.issuetabpanels%3Aall-tabpanel>, maybe they are correlated...

Thank you @Bowen and @Erick





Il giorno lun 1 mar 2021 alle ore 13:39 Bowen Song <bo...@bso.ng.invalid> ha scritto:

    The warning message indicates the node y.y.y.y went down (or is
    unreachable via network) before 2021-02-28 05:17:33. Is there any
    chance you can find the log file on that node at around or before
    that time? It may show why did that node go down. The reason of
    that might be irrelevant to the missing keyspace, but still worth
    to have a look in order to prevent the same thing from happening
    again.

    As Erick said, the table's CF ID isn't new, so it's unlikely to be
    a schema synchronization issue. Therefore I also suspect the
    keyspace was accidentally dropped. Cassandra only logs "Drop
    Keyspace 'keyspace_name'" on the node that received the "DROP
    KEYSPACE ..." query, so you may have to search this in log files
    from all nodes to find it.

    Assuming the keyspace was dropped but you still have the SSTable
    files, you can recover the data by re-creating the keyspace and
    tables with identical replication strategy and schema, then copy
    the SSTable files to the corresponding new table directories (with
    different CF ID suffixes) on the same node, and finally run
    "nodetool refresh ..." or restart the node. Since you don't yet
    have a full backup, I strongly recommend you to make a backup, and
    ideally test restoring it to a different cluster, before
    attempting to do this.


    On 01/03/2021 11:48, Marco Gasparini wrote:
    here the previous error:

    2021-02-28 05:17:33,262 WARN NodeConnectionsService.java:165
    validateAndConnectIfNeeded failed to connect to node
    
{y.y.y.y}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{9ba2d3ee-bc82-4e76-ae24-9e20eb334c24}{y.y.y.y
    }{ y.y.y.y :9300}{ALIVE}{rack=r1, dc=DC1} (tried [1] times)
    org.elasticsearch.transport.ConnectTransportException: [ y.y.y.y
    ][ y.y.y.y :9300] connect_timeout[30s]
    at
    org.elasticsearch.transport.TcpChannel.awaitConnected(TcpChannel.java:163)
    at
    
org.elasticsearch.transport.TcpTransport.openConnection(TcpTransport.java:616)
    at
    
org.elasticsearch.transport.TcpTransport.connectToNode(TcpTransport.java:513)
    at
    
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:336)
    at
    
org.elasticsearch.transport.TransportService.connectToNode(TransportService.java:323)
    at
    
org.elasticsearch.cluster.NodeConnectionsService.validateAndConnectIfNeeded(NodeConnectionsService.java:156)
    at
    
org.elasticsearch.cluster.NodeConnectionsService$ConnectionChecker.doRun(NodeConnectionsService.java:185)
    at
    
org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672)
    at
    
org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
    at
    
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at
    
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

    Yes this node (y.y.y.y) stopped because it went out of disk space.


    I said "deleted" because I'm not a native english speaker :)
    I usually "remove" snapshots via 'nodetool clearsnapshot' or
    cassandra-reaper user interface.




    Il giorno lun 1 mar 2021 alle ore 12:39 Bowen Song
    <bo...@bso.ng.invalid> <mailto:bo...@bso.ng.invalid> ha scritto:

        What was the warning? Is it related to the disk failure
        policy? Could you please share the relevant log? You can edit
        it and redact the sensitive information before sharing it.

        Also, I can't help to notice that you used the word "delete"
        (instead of "clear") to describe the process of removing
        snapshots. May I ask how did you delete the snapshots? Was it
        "nodetool clearsnapshot ...", "rm -rf ..." or something else?


        On 01/03/2021 11:27, Marco Gasparini wrote:
        thanks Bowen for answering

        Actually, I checked the server log and the only warning was
        that a node went offline.
        No, I have no backups or snapshots.

        In the meantime I found that probably Cassandra moved all
        files from a directory to the snapshot directory. I am
        pretty sure of that because I have recently deleted all the
        snapshots I made because it was going out of disk space and
        I found this very directory full of files where the
        modification timestamp was the same as the first error I got
        in the log.



        Il giorno lun 1 mar 2021 alle ore 12:13 Bowen Song
        <bo...@bso.ng.invalid> <mailto:bo...@bso.ng.invalid> ha scritto:

            The first thing I'd check is the server log. The log may
            contain vital information about the cause of it, and
            that there may be different ways to recover from it
            depending on the cause.

            Also, please allow me to ask a seemingly obvious
            question, do you have a backup?


            On 01/03/2021 09:34, Marco Gasparini wrote:
            hello everybody,

            This morning, Monday!!!, I was checking on Cassandra
            cluster and I noticed that all data was missing. I
            noticed the following error on each node (9 nodes in
            the cluster):

            *2021-03-01 09:05:52,984 WARN
             [MessagingService-Incoming-/x.x.x.x]
            IncomingTcpConnection.java:103 run
            UnknownColumnFamilyException reading from socket; closing
            org.apache.cassandra.db.UnknownColumnFamilyException:
            Couldn't find table for cfId
            cba90a70-5c46-11e9-9e36-f54fe3235e69. If a table was
            just created, this is likely due to the schema not
            being fully propagated.  Please wait for schema
            agreement on table creation.
                    at
            
org.apache.cassandra.config.CFMetaData$Serializer.deserialize(CFMetaData.java:1533)
                    at
            
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:758)
                    at
            
org.apache.cassandra.db.ReadCommand$Serializer.deserialize(ReadCommand.java:697)
                    at
            
org.apache.cassandra.io.ForwardingVersionedSerializer.deserialize(ForwardingVersionedSerializer.java:50)
                    at
            org.apache.cassandra.net.MessageIn.read(MessageIn.java:123)
                    at
            
org.apache.cassandra.net.IncomingTcpConnection.receiveMessage(IncomingTcpConnection.java:195)
                    at
            
org.apache.cassandra.net.IncomingTcpConnection.receiveMessages(IncomingTcpConnection.java:183)
                    at
            
org.apache.cassandra.net.IncomingTcpConnection.run(IncomingTcpConnection.java:94)*
            *
            *
            I tried to query the keyspace and got this:

            node1# cqlsh
            Connected to Cassandra Cluster at x.x.x.x:9042.
            [cqlsh 5.0.1 | Cassandra 3.11.5.1 | CQL spec 3.4.4 |
            Native protocol v4]
            Use HELP for help.
            cqlsh> select * from mykeyspace.mytable  where id = 123935;
            *InvalidRequest: Error from server: code=2200 [Invalid
            query] message="Keyspace * *mykeyspace  does not exist"*
            *
            *
            Investigating on each node I found that all the
            *SStables exist*, so I think data is still there but
            the keyspace vanished, "magically".

            Other facts I can tell you are:

              * I have been getting Anticompaction errors from 2
                nodes due to the fact the disk was almost full.
              * the cluster was online friday
              * this morning, Monday, the whole cluster was offline
                and I noticed the problem of "missing keyspace"
              * During the weekend the cluster has been subject to
                inserts and deletes
              * I have a 9 node (HDD) Cassandra 3.11 cluster.

            I really need help on this, how can I restore the cluster?

            Thank you very much
            Marco


            *
            *
            *
            *
            *
            *
            *
            *
            *
            *

Reply via email to