Re: [DISCUSS] Improve Commitlog write path

Bowen Song via dev Tue, 26 Jul 2022 05:06:01 -0700

Hi Amit,

That's some brilliant tests you have done there. It shows that thecompaction throughput not only can be a bottleneck on the speed ofinsert operations, but it can also stress the JVM garbage collector. Asa result of GC pressure, it can cause other things, such as insert, to fail.

Your last statement is correct. The commit log change can be beneficialfor atypical workloads where large volume of data is getting insertedand then expired soon, for example when using theTimeWindowCompactionStrategy with short TTL. But I must point out thatthis kind of atypical usage is often an anti-pattern in Cassandra, asCassandra is a database, not a queue or cache system.

This, however, is not saying the commit log change should not beintroduced. As others have pointed out, it's down to a balancing actbetween the cost and benefit, and it will depend on the code complexityand the effect it has on typical workload, such as CPU and JVM heapusage. After all, we should prioritise the performance and reliabilityof typical usage before optimising for atypical use cases.


Best,
Bowen

On 26/07/2022 12:41, Pawar, Amit wrote:

[Public]


Hi Bowen,
Thanks for the reply and it helped to identify the failure point.Tested compaction throughput with different values and threads activein compaction reports “java.lang.OutOfMemoryError: Map failed” errorwith 1024 MB/s earlier compared to other values. This shows with lowerthroughput such issues are going to come up not immediately but indays or weeks. Test results are given below.
|+------------+-----------------------+-----------------------+-----------------+|
|| Records | Compaction Throughput | 5 large files In GB | Diskusage (GB) ||
|+------------+-----------------------+-----------------------+-----------------+|

|| 2000000000 | 8                     | Not collected | 500             ||

|+------------+-----------------------+-----------------------+-----------------+|

|| 2000000000 | 16                    | Not collected | 500             ||

|+------------+-----------------------+-----------------------+-----------------+|
|| 900000000 | 64 | 3.5,3.5,3.5,3.5,3.5 |273 ||
|+------------+-----------------------+-----------------------+-----------------+|
|| 900000000 | 128 | 3.5, 3.9,4.9,8.0, 15 |287 ||
|+------------+-----------------------+-----------------------+-----------------+|
|| 900000000 | 256 | 11,11,12,16,20 |359 ||
|+------------+-----------------------+-----------------------+-----------------+|
|| 900000000 | 512 | 14,19,23,27,28 |469 ||
|+------------+-----------------------+-----------------------+-----------------+|
|| 900000000 | 1024 | 14,18,23,27,28 |458 ||
|+------------+-----------------------+-----------------------+-----------------+|
|| 900000000 | 0 | 6.9,6.9,7.0,28,28 |223 ||
|+------------+-----------------------+-----------------------+-----------------+|

|| |                       | |                 ||

|+------------+-----------------------+-----------------------+-----------------+|

Issues observed with increasing compaction throughput.

 1. Out of memory errors
 2. Scores reduces as throughput increased
 3. Files size grows as throughput increased
 4. Insert failures are noticed
After this testing, I feel that this change is beneficial forworkloads where data is not kept/left on nodes for too long. Withlower throughput large system can ingest more data. Does it make sense ?
Thanks,

Amit

*From:* Bowen Song via dev <dev@cassandra.apache.org>
*Sent:* Friday, July 22, 2022 4:37 PM
*To:* dev@cassandra.apache.org
*Subject:* Re: [DISCUSS] Improve Commitlog write path

[CAUTION: External Email]

Hi Amit,
The compaction bottleneck is not an instantly visible limitation. Itin effect limits the total size of writes over a fairly long period oftime, because compaction is asynchronous and can be queued. That meansif compaction can't keep up with the writes, they will be queued, andCassandra remains fully functional until hitting the "too many openfiles" error or the filesystem runs out of free inodes. This canhappen over many days or even weeks.
For the purpose of benchmarking, you may prefer to measure the maxconcurrent compaction throughput, instead of actually waiting for thatbreaking moment. The max write throughput is a fraction of the maxconcurrent compaction throughput, usually by a factor of 5 or more fora non-trivial sized table, depending on the table size in bytes.Search for "STCS write amplification" to understand why that's thecase. That means if you've measured the max concurrent compactionthroughput is 1GB/s, your average max insertion speed over a period oftime is probably less than 200MB/s.
If you really decide to test the compaction bottleneck in action, it'sbetter to measure the table size in bytes on disk, rather than thenumber of records. That's because not only the record count, but alsothe size of partitions and compression ratio, all have meaningfuleffect on the compaction workload. It's also worth mentioning that ifusing the STCS strategy, which is more suitable for write heavyworkload, you may want to keep an eye on the SSTable data file sizedistribution. Initially the compaction may not involve any largeSSTable data file, so it won't be a bottleneck at all. As more biggerSSTable data files are created over time, they will get involved incompactions more and more frequently. The bottleneck will only showsup (i.e. become problematic) when there's sufficient number of largeSSTable data files involved in multiple concurrent compactions,occupying all available compactors and blocks (queuing) a largernumber of compactions involving smaller SSTable data files.
Regards,

Bowen

On 22/07/2022 11:19, Pawar, Amit wrote:

    [Public]

    Thank you Bowen for your reply. Took some time to respond due to
    testing issue.

    I tested again multi-threaded feature with number of records from
    260 million to 2 billion and still improvement is seen around 80%
    of Ramdisk score. It is still possible that compaction can become
    new bottleneck and could be new opportunity to fix it. I am newbie
    here and possible that I failed to understand your suggestion
    completely.  At-least with this testing multi-threading benefit is
    reflecting in score.

    Do you think multi-threading is good to have now ? else please
    suggest if I need to test further.

    Thanks,

    Amit

    *From:* Bowen Song via dev <dev@cassandra.apache.org>
    <mailto:dev@cassandra.apache.org>
    *Sent:* Wednesday, July 20, 2022 4:13 PM
    *To:* dev@cassandra.apache.org
    *Subject:* Re: [DISCUSS] Improve Commitlog write path

    [CAUTION: External Email]

    From my past experience, the bottleneck for insert heavy workload
    is likely to be compaction, not commit log. You initially may see
    commit log as the bottleneck when the table size is relatively
    small, but as the table size increases, compaction will likely
    take its place and become the new bottleneck.

    On 20/07/2022 11:11, Pawar, Amit wrote:

        [Public]

        Hi all,

        (My previous mail is not appearing in mailing list and
        resending again after 2 days)

        Myself Amit and working at AMD Bangalore, India. I am new to
        Cassandra and need to do Cassandra testing on large core
        systems. Usually should test on multi-nodes Cassandra but
        started with Single node testing to understand how Cassandra
        scales with increasing core counts.

        Test details:

        Operation: Insert > 90% (insert heavy)

        Operation: Scan < 10%

        Cassandra: 3.11.10 and trunk

        Benchmark: TPCx-IOT (similar to YCSB)

        Results shows scaling is poor beyond 16 cores and it is almost
        linear. Following settings are the common settings helped to
        get the better scores.

        1.Memtable heap allocation: offheap_objects

        2.memtable_flush_writers > 4

        3.Java heap: 8-32GB with survivor ratio tuning

        4.Separate storage space for Commitlog and Data.

        Many online blogs suggest to add new Cassandra node when
        unable to take high writes. But with large systems, high
        writes should be easily taken due to many cores. Need was to
        improve the scaling with more cores so this suggestion didn’t
        help. After many rounds of testing it was observed that
        current implementation uses single thread for Commitlog
        syncing activity. Commitlog files are mapped using mmap system
        call and changes are written with msync. Periodic syncing with
        JVisualvm tool shows

        1.thread is not 100% busy with Ramdisk usage for Commitlog
        storage and scaling improved on large systems. Ramdisk scores
        > 2 X NVME score.

        2.thread becomes 100% busy with NVME usage for Commiglog and
        score does not improve much beyond 16 cores.

        Linux kernel uses 4K pages for mapped memory with mmap system
        call. So, to understand this further, disk I/O testing was
        done using FIO tool and results shows

        1.NVME 4K random R/W throughput is very less with single
        thread and it improves with multi-threaded.

        2.Ramdisk 4K random R/W throughput is good with single thread
        only and also better with multi-threaded

        Based on the FIO test results following two ideas were tested
        for Commitlog files with Cassandra-3.1.10 sources.

        1.Enable Direct IO feature for Commitlog files (similar to
        [CASSANDRA-14466] Enable Direct I/O - ASF JIRA (apache.org)
        
<https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FCASSANDRA-14466&data=05%7C01%7CAmit.Pawar%40amd.com%7C830520176c7a4657417b08da6bd26c94%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C637940848767120963%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=wxXA9jVlNsqbOyBBaX3fAxcPY%2F%2BN1Yxou%2F5GlBsMf64%3D&reserved=0>
        )

        2.Enable Multi-threaded syncing for Commitlog files.

        First one need to retest. Interestingly second one helped to
        improve the score with “NVME” disk. NVME disk configuration
        score is almost within 80-90% of ramdisk and 2 times of single
        threaded implementation. Multithreading enabled by adding new
        thread pool in “AbstractCommitLogSegmentManager” class and
        changed syncing thread as manager thread for this new thread
        pool to take care synchronization. Only tested with
        Cassandra-3.11.10 and needs complete testing but this change
        is working in my test environment. Tried these few experiments
        so that I could discuss here and seek your valuable
        suggestions to identify the right fix for insert heavy workloads.

        1.Is it good idea to convert single threaded syncing to
        multi-threading implementation to improve the disk IO?

        2.Direct I/O throughput is high with single thread and best
        fit for Commitlog case due to file size. This will improve
        writes on small to large systems. Good to bring this support
        for Commitlog files?

        Please suggest.

        Thanks,

        Amit Pawar

Re: [DISCUSS] Improve Commitlog write path

Reply via email to