Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
23, 2024 at 10:45 AM Bowen Song via user wrote: I suspect you are abusing batch statements. Batch statements should only be used where atomicity or isolation is needed. Using batch statements won't make inserting multiple partitions faster. In fact, it often will make that s

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user
what it's worth, I do see 100% CPU utilization in every single one of these tests. On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user wrote: Have you checked the thread CPU utilisation of the client side? You likely will need more than one thread to do insertion in a loo

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Bowen Song via user
n Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user wrote: To achieve 10k loop iterations per second, each iteration must take 0.1 milliseconds or less. Considering that each iteration needs to lock and unlock the semaphore (two syscalls) and make network requests (more sy

Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Bowen Song via user
Hi Paul, You don't need to plan for or introduce an outage for a rolling upgrade, which is the preferred route. It isn't advisable to take down an entire DC to do upgrade. You should aim to complete upgrading the entire cluster and finish a full repair within the shortest gc_grace_seconds (d

Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Bowen Song via user
about having a schema mismatch for this long time. Should I be concerned, or have others upgraded in a similar way? Thanks Paul On 24 Apr 2024, at 17:02, Bowen Song via user wrote: Hi Paul, You don't need to plan for or introduce an outage for a rolling upgrade, which is the preferred

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Bowen Song via user
, Apr 23, 2024 at 10:24 PM Bowen Song via user wrote: You might have run into the bottleneck of the driver's IO thread. Try increase the driver's connections-per-server limit to 2 or 3 if you've only got 1 server in the cluster. Or alternatively, run two clie

Re: compaction trigger after every fix interval

2024-04-28 Thread Bowen Song via user
There's many things that can trigger a compaction, knowing the type of compaction can help narrow it down. Have you looked at the nodetool compactionstats command output when it is happening? What is the compaction type? It can be "compaction", but can also be something else, such as "validati

Re: Change num_tokens in a live cluster

2024-05-16 Thread Bowen Song via user
You can also add a new DC with the desired number of nodes and num_tokens on each node with auto bootstrap disabled, then rebuild the new DC from the existing DC before decommission the existing DC. This method only needs to copy data once, and can copy from/to multiple nodes concurrently, ther

Re: Change num_tokens in a live cluster

2024-05-16 Thread Bowen Song via user
data need to be moved? On 16/05/2024 15:54, Gábor Auth wrote: Hi, On Thu, 16 May 2024, 10:37 Bowen Song via user, wrote: You can also add a new DC with the desired number of nodes and num_tokens on each node with auto bootstrap disabled, then rebuild the new DC from the existing

Re: TWCS Log Warning

2024-05-23 Thread Bowen Song via user
As the log level name "DEBUG" suggested, these are debug messages, not warnings. Is there any reason made you believe that these messages are warnings? On 23/05/2024 11:10, Isaeed Mohanna wrote: Hi I have a big table (~220GB reported by used space live by tablestats) with time series data

Re: Bootstrap error - Cassandra 4.1.5

2024-08-14 Thread Bowen Song via user
It looks like all your nodes are in the same DC and the same rack with 256 vnodes each. It's very hard (if not impossible) to add multiple nodes to the same DC concurrently and safely in this setup. You are better off adding one node at a time to this cluster. Try search for "ERROR" in the log

Re: Bootstrap error - Cassandra 4.1.5

2024-08-15 Thread Bowen Song via user
p for storage.  I'd love a way to double the number of nodes, but sounds like I shouldn't have let it get this far.  We're having some odd performance issues on reads, that I'm diagnosing. -Joe On 8/14/2024 5:07 PM, Bowen Song via user wrote: It looks like all your nodes are in

Re: Bootstrap error - Cassandra 4.1.5

2024-08-15 Thread Bowen Song via user
   99% Any ideas? Thank you! -Joe On 8/15/2024 9:03 AM, Bowen Song via user wrote: You may need to look at the zipped log files if the streaming had been running for a while before failing. The error could have happened hours or days before the final failure. If your cluster is al

Re: Cassandra Inbound Error Message

2024-08-27 Thread Bowen Song via user
Hello Edi, Before attempt to prematurely optimise, let's try to understand the situation a bit better. * What's the bandwidth available? (think: total bandwidth and the typical usage) * What's causing the heavy network load? * How much bandwidth is consumed by the heavy network load? * How l

Re: Cassandra Inbound Error Message

2024-08-29 Thread Bowen Song via user
vent these errors? Where can I find more information about these errors, and under what circumstances do these messages appear? Additionally, what does the term "SMALL_MESSAGES" mean in the error message? Edi On Tue, Aug 27, 2024 at 8:04 PM Bowen Song via user wrote: Hello

Re: Tombstone Generation in Cassandra 4.1.3 Despite No Update/Delete Operations

2024-10-24 Thread Bowen Song via user
Is one tombstone scanned per query causing any issue? I mean real issues, not the scanning of tombstone itself. On 24/10/2024 04:56, Naman kaushik wrote: Thanks everyone for your responses. We have columns with |list| and |list| types, and after using |sstabledump|, we found that the tombston

Re: Cross-Node Latency Issues

2024-10-24 Thread Bowen Song via user
tup. Regards, Ashish On Thu, Oct 24, 2024 at 6:32 PM Bowen Song via user wrote: Can you be more explicit about the "latency metrics from Grafana" you looked at? What percentile latencies were you looking at? Any aggregation used? You can post the underlying queries used for the

Re: Cross-Node Latency Issues

2024-10-24 Thread Bowen Song via user
Can you be more explicit about the "latency metrics from Grafana" you looked at? What percentile latencies were you looking at? Any aggregation used? You can post the underlying queries used for the dashboard if that's easier than explaining it. In general you should only care about the max, no

Re: [External]Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
outside of the organization. From:* Bowen Song via user *Sent:* Tuesday, November 5, 2024 1:12 PM *To:* d...@cassandra.apache.org; user@cassandra.apache.org *Cc:* Bowen Song *Subject:* [External]Unexplained stuck memtable flush This message is from an EXTERNAL SENDER - be CAUTIOUS, particularly

Re: [External]Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
half of new nodes) fill be finish from perspective of data synch? Thx for sharing you best practices, regards     Jiri * * * This item's classification is Internal. It was created by and is in property of the EmbedIT. Do not distribute outside of the organization. From:* Bowen Song via

Re: Migration Cassandra to a new data center

2024-11-05 Thread Bowen Song via user
From the way you wrote this, I suspect the name DC may have different meaning here. Are you talking about the physical location (i.e server rooms), or the Cassandra DC (i.e. group of nodes for replication purposes)? On 05/11/2024 11:01, edi mari wrote: Hello, We have a Cassandra cluster deploy

Re: Migration Cassandra to a new data center

2024-11-05 Thread Bowen Song via user
nks Edi On Tue, Nov 5, 2024 at 1:27 PM Bowen Song via user wrote:  From the way you wrote this, I suspect the name DC may have different meaning here. Are you talking about the physical location (i.e server rooms), or the Cassandra DC (i.e. group of nodes for replication p

Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
Hi all, We have a cluster running Cassandra 4.1.1. We are seeing the memtable flush randomly getting stuck. This has happened twice in the last 10 days, to two different nodes in the same cluster. This started to happen after we enabled CDC, and each time it got stuck, there was at least one

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
it here. On 05/11/2024 13:01, Dmitry Konstantinov wrote: Hi Bowen, would it be possible to share a full thread dump? Regards, Dmitry On Tue, 5 Nov 2024 at 12:12, Bowen Song via user wrote: Hi all, We have a cluster running Cassandra 4.1.1. We are seeing the memtable flush randomly

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
contain thread stacks info. Thread dump (stack traces) is small and does not have sensitive info. Regards, Dmitry On Tue, 5 Nov 2024 at 13:53, Bowen Song via user wrote: It's about 18GB in size and may contain a huge amount of sensitive data (e.g. all the pending writes), so I can&#x

Re: Migration Cassandra to a new data center

2024-11-05 Thread Bowen Song via user
DC3? I'll extend the hint window (e.g., to one week) and allow the other data centers (DC1 and DC2) to save hints for DC3. Then, when DC3 returns online, it can receive and process the hints. Edi On Tue, Nov 5, 2024 at 2:34 PM Bowen Song via user wrote: You just confirmed my susp

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
p (stack traces) is small and does not have sensitive info. Regards, Dmitry On Tue, 5 Nov 2024 at 13:53, Bowen Song via user wrote: It's about 18GB in size and may contain a huge amount of sensitive data (e.g. all the pending writes), so I can't share

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
e cluster, and we haven't seen a single issue since switching to XFS. Thanks for the advice though, I'll keep it in mind if I encounter it again. Jon On Tue, Nov 5, 2024 at 9:18 AM Bowen Song via user wrote: Hi Jon, That is interesting. We happen to be running Cassandra

Re: Unexplained stuck memtable flush

2024-11-05 Thread Bowen Song via user
I will give it a try and see what I can find. I plan to go down the rabbit hole tomorrow. Will keep you updated. On 05/11/2024 17:34, Jeff Jirsa wrote: On Nov 5, 2024, at 4:12 AM, Bowen Song via user wrote: Writes on this node starts to timeout and fail. But if left untouched, it's

Re: Unexplained stuck memtable flush

2024-11-08 Thread Bowen Song via user
atch them properly by itself) --- How many CommitLogSegment objects do you have in your heap dump? What are values for the following fields of CommitLogSegment objects? lastSyncedOffset lastMarkerOffset cdcState Do you have CDC index files written by org.apache.cassandra.db.commitlog.CommitLogSegmen

Re: Unexplained stuck memtable flush

2024-11-12 Thread Bowen Song via user
ndra/db/ReadExecutionController.java#L141C77-L141C97> /            indexController = new ReadExecutionController(command, indexCfs.readOrdering.start(), indexCfs.metadata(), null, null, NO_SAMPLING, false);/ If "/indexCfs.readOrdering.start()/" succeeded but the constructor

Re: CDC and schema disagreement

2024-09-24 Thread Bowen Song via user
Thank you for reporting this. I may check next week more closely and let you know. On Fri, Sep 20, 2024 at 5:43 PM Bowen Song via user wrote: Hi all, I suspect that I've ran into a bug (or two). On Cassandra 4.1.1, when `cdc_enabled` in the cassandra.yaml file is set

Re: Recommend Cassandra consultant

2024-09-27 Thread Bowen Song via user
Hello Jeff, I'm not a consultant, but do have some experience on troubleshooting this type of issues. The first thing in troubleshooting is gathering information. You don't want to troubleshoot issues blindly. Some (but not all) important information are CPU usage, network IO, disk IO, JVM

CDC and schema disagreement

2024-09-20 Thread Bowen Song via user
Hi all, I suspect that I've ran into a bug (or two). On Cassandra 4.1.1, when `cdc_enabled` in the cassandra.yaml file is set to `false` on at least one node in the cluster, and then the `ALTER TABLE ... WITH cdc=...` statement was run against that node, the cluster will end up in the schema

Re: Upgrading to Cassandra 5.0

2024-10-03 Thread Bowen Song via user
The supported and recommend route for upgrading 3.x to 5.x is to upgrade from 3.x to 4.x first, and then from 4.x to 5.x. Even if you've tested upgrading from 3.x to 5.x directly and it worked in a test environment, it is still unsupported and not recommended. That's because you may overlook s

Re: Unexplained stuck memtable flush

2024-11-07 Thread Bowen Song via user
as a wall clock time? I've found that the syncComplete.queue is empty, meaning the WaitQueue object believes that there's nothing waiting for the signal, yet the "read-hotness-tracker:1" thread is clearly waiting for it. On 06/11/2024 13:49, Bowen Song via user wrote: I

Re: Unexplained stuck memtable flush

2024-11-06 Thread Bowen Song via user
ck on the signal.awaitUninterruptibly() Now I know what is blocking the memtable flushing, but I haven't been able to figure out is why it got stuck on waiting for that signal. I would appreciate it if anyone can offer some insight here. On 05/11/2024 17:48, Bowen Song via user wrote: I

Re: Unexplained stuck memtable flush

2024-11-06 Thread Bowen Song via user
06/11/2024 18:36, Bowen Song via user wrote: I can see some similarities and some differences between your thread dump and ours. In your thread dump: * no MemtableFlushWriter thread * the MemtablePostFlush thread is idle * the MemtableReclaimMemory thread is waiting for a barrier, possib

Re: Unexplained stuck memtable flush

2024-11-06 Thread Bowen Song via user
onController(command, indexCfs.readOrdering.start(), indexCfs.metadata(), null, null, NO_SAMPLING, false);/ If "/indexCfs.readOrdering.start()/" succeeded but the constructor "/new ReadExecutionController/", then we are not closing "/indexCfs.readOrdering/", which me

Re: Unexplained stuck memtable flush

2024-11-13 Thread Bowen Song via user
It's interesting how they organised the documentation. So it is guaranteed that the ConcurrentLinkedQueue can be modified and won't break the iterator. But I don't see anything mentioning the reverse. Can an iterator removing items from the middle of a queue (which by definition is FIFO) bre

Re: Increased Disk Usage After Upgrading From Cassandra 3.x.x to 4.1.3

2025-03-14 Thread Bowen Song via user
A few suspects: * snapshots, which could've been created automatically, such as by dropping or truncating tables when auto_snapshots is set to true, or compaction when snapshot_before_compaction is set to true * backups, which could've been created automatically, e.g. when incremental_backup

Re: Cassandra Memory Spikes - Tuning Suggestions?

2025-02-26 Thread Bowen Song via user
Hi vignesh, Correlation does not imply causation. I wouldn't work on the assumption that the memory usage spikes are caused by compactions to start with. It's best to prove the causal effect first. There's multiple ways to do this, I'm just throwing in some ideas: 1. taking a heap dump whil

Re: Issue replacing a dead node

2025-05-16 Thread Bowen Song via user
n the old node returning to DN state. What is `nodetool bootstrap resume` going to do? Is there a risk to running resume when the replacement node is no longer in the cluster? Could too high of a tombstone ratio cause this? On 5/15/25 5:08 PM, Bowen Song via user wrote: The dead node being

Re: Issue replacing a dead node

2025-05-15 Thread Bowen Song via user
The dead node being replaced went back to DN state indicating the new replacement node failed to join the cluster, usually because the streaming was interrupted (e.g. by network issues, or long STW GC pauses). I would start looking for red flags in the logs, including Cassandra's logs, GC logs,

<    1   2