Hi Bowen,
There is only one major table taking 90% of writes.
I will try to increase the "commitlog_segment_size_in_mb" value to 1
GB and set "max_mutation_size_in_kb" to 16MB.
Currently we don't set the "commitlog_total_space_in_mb" value so it
should be using default of 8192 MB (8 GB). What do you suggest for
this parameter's value?
By the way we don't back up the commit log.
Since our reaper repair is not working effectively as now, we still
need to rely on the hints to reply writes cross nodes and DC. Do you
have any suggestion for how old hints could be safely removed without
impacting data consistency? I think this question may be depending on
many factors but I was wondering if there is any kind of rule of thumb?
Thanks,
Jiayong Sun
On Sunday, August 15, 2021, 05:58:11 AM PDT, Bowen Song <bo...@bso.ng>
wrote:
Hi Jiayong,
Based on this statement:
/> //We see the commit logs switched about 10 times per minutes/
I'd also like to know, roughly speaking, how many tables in Cassandra
are being written frequently? I'm asking this because the commit log
segments are being created (and recycled) so frequently (~ every 6
seconds), and I suspect that a lots of tables are involved in each
commit log segment, and that leads to many SSTable flushes.
You could try to drain a node, and then remove all commit logs from
the node and increase the "commitlog_segment_size_in_mb" value to
something much larger (say, 1GB), and also increase the
commitlog_total_space_in_mb accordingly on this node, and see if this
helps improving the situation. Note that you may also want to manually
set the "max_mutation_size_in_kb" to 16MB (the default value is half
of the commit segment size) to prevent unexpected extra large sized
mutations get accepted on this node and then failing on other nodes.
Please also note that this may interfere with some backup tools which
backs up the commit log segments.
In addition to that, if you periodically purge the hints, you probably
are better off by just disabling hinted handoff and make sure you
always run repair within the gc_grace_seconds.
Cheers,
Bowen
On 14/08/2021 03:33, Jiayong Sun wrote:
Hi Bowen,
Thanks for digging into source code so deep.
Here are answers to your questions:
* Does your application changes the table schema frequently?
Specifically: alter table, drop table and drop keyspace. - No,
either admin or apps doesn't frequently alter/drop/create table
schema in run-time.
* Do you have the memtable_flush_period_in_ms table property set to
non-zero on any table at all? - all tables use
"memtable_flush_period_in_ms = 0" for default.
* Is the timing of frequent small SSTable flushes coincident with
streaming activities? - The repair job is paused and don't see
streaming occurring in system.log
* What's your commitlog_segment_size_in_mb and
commitlog_total_space_in_mb in cassandra.yaml and what's your free
space size on the disk where commit log is located? -
"commitlog_segment_size_in_mb: 32". "commitlog_total_space_in_mb"
is not set. The commit logs have separate disk and no chance it'd
be filled up. Data disk is about 50% used.
* How fast do you fill up a commit log segment? I.e.: how fast are
you writing to Cassandra? - We see the commit logs switched about
10 times per minutes, and lots of hints cumulated on disk and
replayed constantly. This could be due to many nodes unresponsive
due to this ongoing issue.
* Anything invoking the "nodetool flush" or "nodetool drain"
command? - no, we don't issue these commands unless restarting a
node manually or automatically through some monitoring mechanism
which is not happening frequesntly.
I doubt if this could be related with huge amount of hints replay. The
cluster is stressed by heave writes through spark jobs and there are
some hot-spot partitions. Huge amount of hints (e.g. >50GB per node)
are not uncommon in this cluster especially since this issue has been
occurring causing many node lost gossip. We have to set up a daily
cron job to clear the older hints from disk, but not sure if this
would hurt data inconsistency among nodes and DCs.
Thoughts?
Thanks,
Jiayong Sun
On Friday, August 13, 2021, 03:39:44 PM PDT, Bowen Song <bo...@bso.ng>
<mailto:bo...@bso.ng> wrote:
Hi Jiayong,
I'm sorry to hear that. I did not know many nodes were/are
experiencing the same issue. A bit of dig in the source code indicates
the log below comes from the ColumnFamilyStore.logFlush() method.
DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:932 -
Enqueuing flush of sstable_activity: 0.408KiB (0%) on-heap,
0.154KiB (0%) off-heap
The ColumnFamilyStore.logFlush() method is a private method and the
only place referencing to it is the ColumnFamilyStore.switchmemtable()
method in the same file, and that has been referenced in two places -
ColumnFamilyStore.reload() and
ColumnFamilyStore.switchMemtableIfCurrent(). Ignoring secondary index
and MV, the former is only being called during the node starts up and
on table schema changes, and it's unlikely our suspect (unless you are
frequently changing the schema or restarting the node). The later is
being referenced in two methods: ColumnFamilyStore.forceFlush() and
ColumnFamilyStore.FlushLargestColumnFamily.run(). The
ColumnFamilyStore.FlushLargestColumnFamily.run() method is only called
by the MemtableCleanerThread, and we have pretty much ruled that out
in the previous conversations. The forceFlush() method is invoked if
the table property memtable_flush_period_in_ms is set, when Cassandra
is preparing for sending/receiving files via streaming, when an old
segment of commit log is recycled,on "nodetool drain", and again, on
schema changes (drop keyspace/table).
So, my questions would be:
* Does your application changes the table schema frequently?
Specifically: alter table, drop table and drop keyspace.
* Do you have the memtable_flush_period_in_ms table property set to
non-zero on any table at all?
* Is the timing of frequent small SSTable flushes coincident with
streaming activities?
* What's your commitlog_segment_size_in_mb and
commitlog_total_space_in_mb in cassandra.yaml and what's your free
space size on the disk where commit log is located?
* How fast do you fill up a commit log segment? I.e.: how fast are
you writing to Cassandra?
* Anything invoking the "nodetool flush" or "nodetool drain" command?
I hope the above questions will help you find the root cause.
Cheers,
Bowen
On 13/08/2021 22:47, Jiayong Sun wrote:
Hi Bowen,
There are many nodes having this issue and some of them repeatedly
having it.
Replacing a node by wiping out everything and streaming in good shape
of sstables would work, but if we don't know the root cause the node
would be in the bad shape again.
Yes, we know the reaper repair running so long like weeks is not good
which most likely due to the multiple DC with large size of rings. We
are planning to upgrade to newer version of reaper to see if that helps.
We do have debug.log turned on but didn't catch anything helpful other
than those constant enqueuing/flashing/deleting of memtable and
sstables (I listed a few examples messages at beginning of this email
thread).
Thanks for all your thoughts and I really appreciate.
Thanks,
Jiayong Sun
On Friday, August 13, 2021, 01:36:21 PM PDT, Bowen Song <bo...@bso.ng>
<mailto:bo...@bso.ng> wrote:
Hi Jiayong,
That doesn't really match the situation described in the SO question.
I suspected it was related to repairing a table with MV and large
partitions, but based on the information you've given, I was clearly
wrong.
A few hundreds MB partitions is not exactly unusual, I don't see that
alone could lead to frequent SSTable flushing. A repair session takes
weeks to complete is a bit worrying in terms of performance and
maintainability, but again it should not cause this issue.
Since we don't know the cause of it, I can see two possible solutions
- either replace the "broken" node, or dig into the logs (remember to
turn on the debug logs) and trying to identify the root cause. I
personally would recommend replacing the problematic node as a quick win.
Cheers,
Bowen
On 13/08/2021 20:31, Jiayong Sun wrote:
Hi Bowen,
We do have reaper repair job scheduled periodically and it can take
days even weeks to complete one round of repair due to large number of
rings/nodes. However, we have paused the repair since we are facing
this issue.
We do not use the MV in this cluster.
There is major table taking 95% of disk storage and workload but its
Partition Size is around 30 MB. There are a couple small tables with
the Max Partition Size over several hundreds of MB but their total
data size just about a few GB.
Any thoughts?
Thanks,
Jiayong
On Friday, August 13, 2021, 03:32:45 AM PDT, Bowen Song <bo...@bso.ng>
<mailto:bo...@bso.ng> wrote:
Hi Jiayong,
Sorry I didn't make it clear in my previous email. When I commented on
the RAID0 setup, it was only a comment on the RAID0 setup vs JBOD, and
that was not in relation to the SSTable flushing issue. The part of my
previous email after the "On the frequent SSTable flush issue" line is
the part related to the SSTable flushing issue, and those two
questions at the end of it remain valid:
* Did you run repair?
* Do you use materialized views?
and, if I may, I'd also like to add another question:
* Do you have large (> 100 MB) partitions?
Those are the 3 things mentioned in the SO question. I'm trying to
find the connections between the issue you are experiencing and the
issue described in the SO question.
Cheers,
Bowen
On 13/08/2021 01:36, Jiayong Sun wrote:
Hello Bowen,
Thanks for your response.
Yes, we are aware of the theory that RAID0 vs individual JBOD, but all
of our clusters are using this RAID0 configuration through Azure,
while only on this cluster we see this issue so it's hardly to
conclude root cause to the disk. This is more like workload related,
and we are seeking feedback here for any other parameters in the yaml
that we could tune for this.
Thanks again,
Jiayong Sun
On Thursday, August 12, 2021, 04:55:51 AM PDT, Bowen Song
<bo...@bso.ng> <mailto:bo...@bso.ng> wrote:
Hello Jiayong,
Using multiple disks in a RAID0 for Cassandra data directory is not
recommended. You will get better fault tolerance and often better
performance too with multiple data directories, one on each disk.
If you stick with RAID0, it's not 4 disks, it's 1 from Cassandra's
point of view, because any read or write operation will have to touch
all 4 member disks. Therefore, 4 flush writers doesn't make much sense.
On the frequent SSTable flush issue, a quick internet search leads me to:
* an old bug in Cassandra 2.1 - CASSANDRA-8409
<https://issues.apache.org/jira/browse/CASSANDRA-8409> which
shouldn't affect 3.x at all
* a StackOverflow question
<https://stackoverflow.com/questions/61030392/cassandra-node-jvm-hang-during-node-repair-a-table-with-materialized-view>
may be related
Did you run repair? Do you use materialized views?
Regards,
Bowen
On 11/08/2021 15:58, Jiayong Sun wrote:
Hi Erick,
The nodes have 4 SSD (1TB for each but we only use 2.4TB of space.
Current disk usage is about 50%) with RAID0.
Based on number of disks we increased memtable_flush_writers: 4
instead of default of 2.
For the following we set:
- max heap size - 31GB
- memtable_heap_space_in_mb (use default)
- memtable_offheap_space_in_mb (use default)
In the logs, we also noticed system.sstable_activity table has
hundreds of MB or GB of data and constantly flushing:
DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:932 -
Enqueuing flush of sstable_activity: 0.293KiB (0%) on-heap, 0.107KiB
(0%) off-heap
DEBUG [NonPeriodicTasks:1] <timestamp> SSTable.java:105 - Deleting
sstable:
/app/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/md-103645-big
DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:1322 -
Flushing largest CFS(Keyspace='system',
ColumnFamily='sstable_activity') to free up room. Used total:
0.06/1.00, live: 0.00/0.00, flushing: 0.02/0.29, this: 0.00/0.00
Thanks,
Jiayong Sun
On Wednesday, August 11, 2021, 12:06:27 AM PDT, Erick Ramirez
<erick.rami...@datastax.com> <mailto:erick.rami...@datastax.com> wrote:
4 flush writers isn't bad since the default is 2. It doesn't make a
difference if you have fast disks (like NVMe SSDs) because only 1
thread gets used.
But if flushes are slow, the work gets distributed to 4 flush writers
so you end up with smaller flush sizes although it's difficult to tell
how tiny the SSTables would be without analysing the logs and overall
performance of your cluster.
Was there a specific reason you decided to bump it up to 4? I'm just
trying to get a sense of why you did it since it might provide some
clues. Out of curiosity, what do you have set for the following?
- max heap size
- memtable_heap_space_in_mb
- memtable_offheap_space_in_mb