Re: Large number of tiny sstables flushed constantly

Bowen Song Fri, 13 Aug 2021 13:35:53 -0700

Hi Jiayong,

That doesn't really match the situation described in the SO question. Isuspected it was related to repairing a table with MV and largepartitions, but based on the information you've given, I was clearly wrong.

A few hundreds MB partitions is not exactly unusual, I don't see thatalone could lead to frequent SSTable flushing. A repair session takesweeks to complete is a bit worrying in terms of performance andmaintainability, but again it should not cause this issue.

Since we don't know the cause of it, I can see two possible solutions -either replace the "broken" node, or dig into the logs (remember to turnon the debug logs) and trying to identify the root cause. I personallywould recommend replacing the problematic node as a quick win.



Cheers,

Bowen

On 13/08/2021 20:31, Jiayong Sun wrote:

Hi Bowen,
We do have reaper repair job scheduled periodically and it can takedays even weeks to complete one round of repair due to large number ofrings/nodes. However, we have paused the repair since we are facingthis issue.
We do not use the MV in this cluster.
There is major table taking 95% of disk storage and workload but itsPartition Size is around 30 MB. There are a couple small tables withthe Max Partition Size over several hundreds of MB but their totaldata size just about a few GB.
Any thoughts?

Thanks,
Jiayong
On Friday, August 13, 2021, 03:32:45 AM PDT, Bowen Song <bo...@bso.ng>wrote:
Hi Jiayong,
Sorry I didn't make it clear in my previous email. When I commented onthe RAID0 setup, it was only a comment on the RAID0 setup vs JBOD, andthat was not in relation to the SSTable flushing issue. The part of myprevious email after the "On the frequent SSTable flush issue" line isthe part related to the SSTable flushing issue, and those twoquestions at the end of it remain valid:
  * Did you run repair?
  * Do you use materialized views?

and, if I may, I'd also like to add another question:

  * Do you have large (> 100 MB) partitions?
Those are the 3 things mentioned in the SO question. I'm trying tofind the connections between the issue you are experiencing and theissue described in the SO question.
Cheers,

Bowen


On 13/08/2021 01:36, Jiayong Sun wrote:
Hello Bowen,

Thanks for your response.
Yes, we are aware of the theory that RAID0 vs individual JBOD, but allof our clusters are using this RAID0 configuration through Azure,while only on this cluster we see this issue so it's hardly toconclude root cause to the disk. This is more like workload related,and we are seeking feedback here for any other parameters in the yamlthat we could tune for this.
Thanks again,
Jiayong Sun
On Thursday, August 12, 2021, 04:55:51 AM PDT, Bowen Song<bo...@bso.ng> <mailto:bo...@bso.ng> wrote:
Hello Jiayong,
Using multiple disks in a RAID0 for Cassandra data directory is notrecommended. You will get better fault tolerance and often betterperformance too with multiple data directories, one on each disk.
If you stick with RAID0, it's not 4 disks, it's 1 from Cassandra'spoint of view, because any read or write operation will have to touchall 4 member disks. Therefore, 4 flush writers doesn't make much sense.
On the frequent SSTable flush issue, a quick internet search leads me to:

    * an old bug in Cassandra 2.1 - CASSANDRA-8409
    <https://issues.apache.org/jira/browse/CASSANDRA-8409> which
    shouldn't affect 3.x at all

    * a StackOverflow question
    
<https://stackoverflow.com/questions/61030392/cassandra-node-jvm-hang-during-node-repair-a-table-with-materialized-view>
    may be related

Did you run repair? Do you use materialized views?


Regards,

Bowen


On 11/08/2021 15:58, Jiayong Sun wrote:
Hi Erick,
The nodes have 4 SSD (1TB for each but we only use 2.4TB of space.Current disk usage is about 50%) with RAID0.Based on number of disks we increased memtable_flush_writers: 4instead of default of 2.
For the following we set:
- max heap size - 31GB
- memtable_heap_space_in_mb (use default)
- memtable_offheap_space_in_mb  (use default)
In the logs, we also noticed system.sstable_activity table hashundreds of MB or GB of data and constantly flushing:DEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:932 -Enqueuing flush of sstable_activity: 0.293KiB (0%) on-heap, 0.107KiB(0%) off-heapDEBUG [NonPeriodicTasks:1] <timestamp> SSTable.java:105 - Deletingsstable:/app/cassandra/data/system/sstable_activity-5a1ff267ace03f128563cfae6103c65e/md-103645-bigDEBUG [NativePoolCleaner] <timestamp> ColumnFamilyStore.java:1322 -Flushing largest CFS(Keyspace='system',ColumnFamily='sstable_activity') to free up room. Used total:0.06/1.00, live: 0.00/0.00, flushing: 0.02/0.29, this: 0.00/0.00
Thanks,
Jiayong Sun
On Wednesday, August 11, 2021, 12:06:27 AM PDT, Erick Ramirez<erick.rami...@datastax.com> <mailto:erick.rami...@datastax.com> wrote:
4 flush writers isn't bad since the default is 2. It doesn't make adifference if you have fast disks (like NVMe SSDs) because only 1thread gets used.
But if flushes are slow, the work gets distributed to 4 flush writersso you end up with smaller flush sizes although it's difficult to tellhow tiny the SSTables would be without analysing the logs and overallperformance of your cluster.
Was there a specific reason you decided to bump it up to 4? I'm justtrying to get a sense of why you did it since it might provide someclues. Out of curiosity, what do you have set for the following?
- max heap size
- memtable_heap_space_in_mb
- memtable_offheap_space_in_mb

Re: Large number of tiny sstables flushed constantly

Reply via email to