[
https://issues.apache.org/jira/browse/CASSANDRA-21134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057927#comment-18057927
]
Sam Lightfoot edited comment on CASSANDRA-21134 at 3/4/26 9:55 PM:
-------------------------------------------------------------------
Preliminary test results for unthrottled cursor compaction with direct IO for
compaction writes:
* Memory: 12GB
* Max Heap: 6GB
* Hot dataset size: 1GB
* Compaction reads: buffered
_Read Latency During Compaction_
|Percentile|Buffered #1 (ms)|Buffered #2 (ms)|Direct IO #1 (ms)|Direct IO #2
(ms)|
|—|—|—|—|—|
|p50|0.48|0.48|0.48|0.48|
|p90|1.36|1.34|0.98|0.97|
|p99|16.52|15.79|1.94|1.87|
|p99.9|141.6|59.5|8.19|7.60|
|p99.99|190.8|96.5|27.5|23.3|
|Mean|1.41|1.16|0.61|0.61|
|StdDev|7.74|3.94|0.67|0.61|
|Max|285.2|160.4|86.5|80.7|
|Total reads|5,730,193|5,730,138|5,730,235|5,730,288|
_Summary_
||Metric||Buffered (avg)||Write-DIO (avg)||Both-DIO (avg)||Write-DIO vs
Buffered||Both-DIO vs Buffered||Notes||
|*p99 read latency*|16.2 ms|1.9 ms|1.90 ms|*8.5x*|*8.5x*| |
|*p99.9 read latency*|100.6 ms|7.9 ms|8.47 ms|*12.7x*|*11.9x*| |
|*Mean read latency*|1.29 ms|0.61 ms|0.61 ms|*2.1x*|*2.1x*| |
|*Stall_us/s*|31,599|21,534|14,170|*-32%*|*-55%*|Time app blocks waiting for
kernel page reclaim|
|*Cache Hit Ratio*|16.1%|17.6%|34.5%|*+1.5pp*|*+18.4pp (2x)*|Both-DIO:
compaction reads no longer pollute page cache|
|*Compaction throughput*|246 MiB/s|286 MiB/s|~283 MiB/s|*+16%*|*+15%*| |
|*Cache Dirty Writes/s*|42,862|1,038|874|*-97.6%*|*-98.0%*|Dirty pages/s
entering page cache|
|*Device r_await*|0.386 ms|0.210 ms|0.209 ms|*-46%*|*-46%*|NVMe read latency|
|*Device w_await*|9.14 ms|1.04 ms|1.08 ms|*-89%*|*-88%*|NVMe write latency|
|*Device aqu-sz*|11.90|1.52|3.00|*-87%*|*-75%*|I/O queue depth|
|p50 read latency|0.48 ms|0.48 ms|0.49 ms|—|—|Median unaffected — damage is
tail-only|
|Active_File|471 MB|488 MB|433 MB|+4%|-8%|Page cache pages actively referenced
by reads|
*Write-DIO* = compaction writes via O_DIRECT, reads buffered (trunk).
*Both-DIO* = compaction reads + writes via O_DIRECT
({{CASSANDRA-21134-21147-combined}} branch).
Both-DIO matches write-only DIO at p99 but delivers additional kernel health
improvements: 55% less reclaim stall time and 2x page cache hit ratio. The p99
equivalence is explained by increased device queue depth (1.52 → 3.00) from
compaction reads bypassing the page cache and hitting the device directly —
this device contention offsets the stall reduction.
was (Author: JIRAUSER302824):
Preliminary test results for unthrottled cursor compaction with direct IO for
compaction writes:
* Memory: 12GB
* Max Heap: 6GB
* Hot dataset size: 1GB
* Compaction reads: buffered
_Read Latency During Compaction_
|Percentile|Buffered #1 (ms)|Buffered #2 (ms)|Direct IO #1 (ms)|Direct IO #2
(ms)|
|—|—|—|—|—|
|p50|0.48|0.48|0.48|0.48|
|p90|1.36|1.34|0.98|0.97|
|p99|16.52|15.79|1.94|1.87|
|p99.9|141.6|59.5|8.19|7.60|
|p99.99|190.8|96.5|27.5|23.3|
|Mean|1.41|1.16|0.61|0.61|
|StdDev|7.74|3.94|0.67|0.61|
|Max|285.2|160.4|86.5|80.7|
|Total reads|5,730,193|5,730,138|5,730,235|5,730,288|
_Summary_
||Metric||Buffered (avg)||Write-DIO (avg)||Both-DIO (avg)||Write-DIO
Improvement||Both-DIO Improvement||Notes||
|*p99 read latency*|16.2 ms|1.9 ms|1.90 ms|*8.5x*|*8.5x*| |
|*p99.9 read latency*|100.6 ms|7.9 ms|8.47 ms|*12.7x*|*11.9x*| |
|*Mean read latency*|1.29 ms|0.61 ms|0.61 ms|*2.1x*|*2.1x*| |
|*Stall_us/s*|31,599|21,534|14,170|*-32%*|*-55%*|Time app blocks waiting for
kernel page reclaim|
|*Active_File*|471 MB|488 MB|433 MB|+4%|-8%|Page cache pages actively
referenced by reads|
|*Compaction throughput*|246 MiB/s|286 MiB/s|~283 MiB/s|*+16%*|*+15%*| |
|p50 read latency|0.48 ms|0.48 ms|0.49 ms|—|—| |
|Cache Hit Ratio|16.1%|17.6%|34.5%|+1.5pp|*+18.4pp (2x)*|Both-DIO: compaction
reads no longer pollute page cache|
|*Cache Dirty Writes*|42,862|1,038|874|*-97.6%*|*-98.0%*|Page cache dirty
writes/s|
|*Device r_await*|0.386 ms|0.210 ms|0.209 ms|*-46%*|*-46%*|NVMe read latency —
reads wait behind writeback queue|
|*Device w_await*|9.14 ms|1.04 ms|1.08 ms|*-89%*|*-88%*|NVMe write latency —
writeback batching vs direct writes|
|*Device aqu-sz*|11.90|1.52|3.00|*-87%*|*-75%*|I/O queue depth — writeback
floods the queue|
*Write-DIO* = compaction writes via O_DIRECT, reads buffered (trunk).
*Both-DIO* = compaction reads + writes via O_DIRECT
({{{}CASSANDRA-21134-21147-combined{}}} branch).
Both-DIO matches write-only DIO at p99 but delivers additional kernel health
improvements: 55% less reclaim stall time and 2x page cache hit ratio. The p99
equivalence is explained by increased device queue depth (1.52 → 3.00) from
compaction reads bypassing the page cache and hitting the device directly —
{*}this device contention offsets the stall reduction{*}.
> Direct IO support for compaction writes
> ---------------------------------------
>
> Key: CASSANDRA-21134
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21134
> Project: Apache Cassandra
> Issue Type: Improvement
> Components: Local/Compaction
> Reporter: Sam Lightfoot
> Assignee: Sam Lightfoot
> Priority: Normal
> Fix For: 5.x
>
> Attachments: image-2026-02-11-17-22-58-361.png,
> image-2026-02-11-17-25-58-329.png
>
>
> Follow-up from the implementation for compaction reads (CASSANDRA-19987)
> Notable points
> * Update the start-up check that impacts DIO writes
> ({_}checkKernelBug1057843{_})
> * RocksDB uses 1 MB flush buffer. This should be configurable and
> performance tested (256KB vs 1MB)
> * Introduce compaction_write_disk_access_mode /
> backgroud_write_disk_access_mode
> * Support for the compressed path would be most beneficial
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]