Re: [External]Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
Hi Jiri, Thank you for taking a look at this issue. But I'm sorry, I don't really understand your message. Can you please elaborate? Cheers, Bowen On 05/11/2024 12:34, Jiri Steuer (EIT) wrote: Hi all, It is possible easy to check the moment/milestone, when the data cross more data centers

RE: [External]Unexplained stuck memtable flush

2024-11-05 Thread Jiri Steuer (EIT)
Of cause, let me explain the situation. I have a common question without direct relation the problem with “Unexplained stuck memtable flush”. I would like to know, how can I identify situation that all nodes cross all data centers will be synch. * It is little tricky to wait e.g. 1 day, 2 d

Re: [External]Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
If it is not related to the memtable flush issue, can you please post in a different mailing list thread instead? By replying to this thread, everyone reading it would initially assume it is somehow related, which neither good for them (wasting their time to try to understand it) nor you (your

Re: Migration Cassandra to a new data center

2024-11-05 Thread edi mari
Each physical data center corresponds to a "logical" Cassandra DC (a group of nodes). In our situation, we need to move one of our physical data centers (i.e., the server rooms) to a new location, which will involve an extended period of downtime. Thanks Edi On Tue, Nov 5, 2024 at 1:27 PM Bowen S

Re: Migration Cassandra to a new data center

2024-11-05 Thread Bowen Song via user
From the way you wrote this, I suspect the name DC may have different meaning here. Are you talking about the physical location (i.e server rooms), or the Cassandra DC (i.e. group of nodes for replication purposes)? On 05/11/2024 11:01, edi mari wrote: Hello, We have a Cassandra cluster deploy

Re: Migration Cassandra to a new data center

2024-11-05 Thread Bowen Song via user
You just confirmed my suspicion. You are indeed referring to both physical location of servers and the logical Cassandra DC with the same term here. The questions are related to the procedure of migrating the server hardware to a new location, not the Cassandra DC. Assuming that the IP addre

RE: [External]Unexplained stuck memtable flush

2024-11-05 Thread Jiri Steuer (EIT)
Hi all, It is possible easy to check the moment/milestone, when the data cross more data centers will by synch (in case that other applications and user access will be disabled)? I think about monitoring of throughput or …? Thx for feedback J. Steuer This item's classification is Internal

Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
Hi all, We have a cluster running Cassandra 4.1.1. We are seeing the memtable flush randomly getting stuck. This has happened twice in the last 10 days, to two different nodes in the same cluster. This started to happen after we enabled CDC, and each time it got stuck, there was at least one

Re: Unexplained stuck memtable flush

2024-11-05 Thread Dmitry Konstantinov
Hi Bowen, would it be possible to share a full thread dump? Regards, Dmitry On Tue, 5 Nov 2024 at 12:12, Bowen Song via user wrote: > Hi all, > > We have a cluster running Cassandra 4.1.1. We are seeing the memtable > flush randomly getting stuck. This has happened twice in the last 10 days, >

Re: Migration Cassandra to a new data center

2024-11-05 Thread edi mari
Thank you for your reply, Bowen. Correct, the questions were about migrating the server hardware to a new location, not the Cassandra DC. Wouldn’t it be a good idea to use the hints to complete the data to DC3? I'll extend the hint window (e.g., to one week) and allow the other data centers (DC1 a

Re: Unexplained stuck memtable flush

2024-11-05 Thread Dmitry Konstantinov
I am speaking about a thread dump (stack traces for all threads), not a heap dump. The heap dump should contain thread stacks info. Thread dump (stack traces) is small and does not have sensitive info. Regards, Dmitry On Tue, 5 Nov 2024 at 13:53, Bowen Song via user wrote: > It's about 18GB in

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
It's about 18GB in size and may contain a huge amount of sensitive data (e.g. all the pending writes), so I can't share it. However, if there's any particular piece of information you would like to have, I'm more than happy to extract the info from the dump and and share it here. On 05/11/2024

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
Sorry, I must have misread it. The full thread dump is attached. I compressed it with gzip because the text file is over 1 MB in size. On 05/11/2024 14:04, Dmitry Konstantinov wrote: I am speaking about a thread dump (stack traces for all threads), not a heap dump. The heap dump should contain

Re: Migration Cassandra to a new data center

2024-11-05 Thread Bowen Song via user
Hinted hand off is a best effort approach, and relying on it alone is a bad idea. Hints can get lost due to a number of reasons, such as getting too old or too big, or the node storing the hints dies. You should rely on regular repair to guarantee the correctness of the data. You may use hinted

Re: Unexplained stuck memtable flush

2024-11-05 Thread Jon Haddad
I ran into this a few months ago, and in my case I tracked it down to an issue with ZFS not unlinking commitlogs properly. https://issues.apache.org/jira/browse/CASSANDRA-19564 On Tue, Nov 5, 2024 at 6:05 AM Dmitry Konstantinov wrote: > I am speaking about a thread dump (stack traces for all th

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
Hi Jon, That is interesting. We happen to be running Cassandra on ZFS. However we have not had any incident for years with this setup, the only change is the recent addition of CDC. I can see that in CASSANDRA-19564, the MemtablePostFlush thread was stuck on the unlink() syscall. But in our

Re: Unexplained stuck memtable flush

2024-11-05 Thread Jon Haddad
Yeah, I looked through your stack trace and saw it wasn't the same thing, but the steps to identify the root cause should be the same. I nuked ZFS from orbit :) This was happening across all the machines at various times in the cluster, and we haven't seen a single issue since switching to XFS.

Re: Unexplained stuck memtable flush

2024-11-05 Thread Jeff Jirsa
> On Nov 5, 2024, at 4:12 AM, Bowen Song via user > wrote: > > Writes on this node starts to timeout and fail. But if left untouched, it's > only gonna get worse, and eventually lead to JVM OOM and crash. > > By inspecting the heap dump created at OOM, we can see that both of the > Memtable

Migration Cassandra to a new data center

2024-11-05 Thread edi mari
Hello, We have a Cassandra cluster deployed across three different data centers, with each data center (DC1, DC2, and DC3) hosting 50 Cassandra nodes. We are currently saving one replica in each data center. We plan to migrate DC3, including storage and servers, to a new data center. 1. What woul

Re: Upgrade from 4 to 5 issue

2024-11-05 Thread Joe Obernberger
Found issue - num tokens was set incorrectly in my container. Upgrade successful! -Joe On 11/5/2024 2:27 PM, Joe Obernberger wrote: Hi all - getting an error trying to upgrade our 4.x cluster to 5.  The following message repeats over and over and then the pod crashes: Heap dump creation on u

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
Funny enough, we used to run on ext4 and XFS on mdarray RAID1, but the crappy disks we had (and still have) randomly spitting out garbage data every once in a while. We suspected it's a firmware bug but unable to confirm or reliably reproduce it. Other than this behaviour, those disks work fine

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
I will give it a try and see what I can find. I plan to go down the rabbit hole tomorrow. Will keep you updated. On 05/11/2024 17:34, Jeff Jirsa wrote: On Nov 5, 2024, at 4:12 AM, Bowen Song via user wrote: Writes on this node starts to timeout and fail. But if left untouched, it's only

Upgrade from 4 to 5 issue

2024-11-05 Thread Joe Obernberger
Hi all - getting an error trying to upgrade our 4.x cluster to 5.  The following message repeats over and over and then the pod crashes: Heap dump creation on uncaught exceptions is disabled. DEBUG [MemtableFlushWriter:2] 2024-11-05 19:25:12,763 ColumnFamilyStore.java:1379 - Flu