[jira] [Comment Edited] (CASSANDRA-21134) Direct IO support for compaction writes

Sam Lightfoot (Jira) Wed, 04 Mar 2026 13:56:09 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-21134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057927#comment-18057927
 ]


Sam Lightfoot edited comment on CASSANDRA-21134 at 3/4/26 9:55 PM:
-------------------------------------------------------------------

Preliminary test results for unthrottled cursor compaction with direct IO for 
compaction writes:
 * Memory: 12GB
 * Max Heap: 6GB
 * Hot dataset size: 1GB
 * Compaction reads: buffered

 

_Read Latency During Compaction_
|Percentile|Buffered #1 (ms)|Buffered #2 (ms)|Direct IO #1 (ms)|Direct IO #2 
(ms)|
|—|—|—|—|—|
|p50|0.48|0.48|0.48|0.48|
|p90|1.36|1.34|0.98|0.97|
|p99|16.52|15.79|1.94|1.87|
|p99.9|141.6|59.5|8.19|7.60|
|p99.99|190.8|96.5|27.5|23.3|
|Mean|1.41|1.16|0.61|0.61|
|StdDev|7.74|3.94|0.67|0.61|
|Max|285.2|160.4|86.5|80.7|
|Total reads|5,730,193|5,730,138|5,730,235|5,730,288|

_Summary_
||Metric||Buffered (avg)||Write-DIO (avg)||Both-DIO (avg)||Write-DIO vs 
Buffered||Both-DIO vs Buffered||Notes||
|*p99 read latency*|16.2 ms|1.9 ms|1.90 ms|*8.5x*|*8.5x*| |
|*p99.9 read latency*|100.6 ms|7.9 ms|8.47 ms|*12.7x*|*11.9x*| |
|*Mean read latency*|1.29 ms|0.61 ms|0.61 ms|*2.1x*|*2.1x*| |
|*Stall_us/s*|31,599|21,534|14,170|*-32%*|*-55%*|Time app blocks waiting for 
kernel page reclaim|
|*Cache Hit Ratio*|16.1%|17.6%|34.5%|*+1.5pp*|*+18.4pp (2x)*|Both-DIO: 
compaction reads no longer pollute page cache|
|*Compaction throughput*|246 MiB/s|286 MiB/s|~283 MiB/s|*+16%*|*+15%*| |
|*Cache Dirty Writes/s*|42,862|1,038|874|*-97.6%*|*-98.0%*|Dirty pages/s 
entering page cache|
|*Device r_await*|0.386 ms|0.210 ms|0.209 ms|*-46%*|*-46%*|NVMe read latency|
|*Device w_await*|9.14 ms|1.04 ms|1.08 ms|*-89%*|*-88%*|NVMe write latency|
|*Device aqu-sz*|11.90|1.52|3.00|*-87%*|*-75%*|I/O queue depth|
|p50 read latency|0.48 ms|0.48 ms|0.49 ms|—|—|Median unaffected — damage is 
tail-only|
|Active_File|471 MB|488 MB|433 MB|+4%|-8%|Page cache pages actively referenced 
by reads|

*Write-DIO* = compaction writes via O_DIRECT, reads buffered (trunk).
*Both-DIO* = compaction reads + writes via O_DIRECT 
({{CASSANDRA-21134-21147-combined}} branch).

Both-DIO matches write-only DIO at p99 but delivers additional kernel health 
improvements: 55% less reclaim stall time and 2x page cache hit ratio. The p99 
equivalence is explained by increased device queue depth (1.52 → 3.00) from 
compaction reads bypassing the page cache and hitting the device directly — 
this device contention offsets the stall reduction.


was (Author: JIRAUSER302824):
Preliminary test results for unthrottled cursor compaction with direct IO for 
compaction writes:
 * Memory: 12GB
 * Max Heap: 6GB
 * Hot dataset size: 1GB
 * Compaction reads: buffered

 

_Read Latency During Compaction_
|Percentile|Buffered #1 (ms)|Buffered #2 (ms)|Direct IO #1 (ms)|Direct IO #2 
(ms)|
|—|—|—|—|—|
|p50|0.48|0.48|0.48|0.48|
|p90|1.36|1.34|0.98|0.97|
|p99|16.52|15.79|1.94|1.87|
|p99.9|141.6|59.5|8.19|7.60|
|p99.99|190.8|96.5|27.5|23.3|
|Mean|1.41|1.16|0.61|0.61|
|StdDev|7.74|3.94|0.67|0.61|
|Max|285.2|160.4|86.5|80.7|
|Total reads|5,730,193|5,730,138|5,730,235|5,730,288|

_Summary_
||Metric||Buffered (avg)||Write-DIO (avg)||Both-DIO (avg)||Write-DIO 
Improvement||Both-DIO Improvement||Notes||
|*p99 read latency*|16.2 ms|1.9 ms|1.90 ms|*8.5x*|*8.5x*| |
|*p99.9 read latency*|100.6 ms|7.9 ms|8.47 ms|*12.7x*|*11.9x*| |
|*Mean read latency*|1.29 ms|0.61 ms|0.61 ms|*2.1x*|*2.1x*| |
|*Stall_us/s*|31,599|21,534|14,170|*-32%*|*-55%*|Time app blocks waiting for 
kernel page reclaim|
|*Active_File*|471 MB|488 MB|433 MB|+4%|-8%|Page cache pages actively 
referenced by reads|
|*Compaction throughput*|246 MiB/s|286 MiB/s|~283 MiB/s|*+16%*|*+15%*| |
|p50 read latency|0.48 ms|0.48 ms|0.49 ms|—|—| |
|Cache Hit Ratio|16.1%|17.6%|34.5%|+1.5pp|*+18.4pp (2x)*|Both-DIO: compaction 
reads no longer pollute page cache|
|*Cache Dirty Writes*|42,862|1,038|874|*-97.6%*|*-98.0%*|Page cache dirty 
writes/s|
|*Device r_await*|0.386 ms|0.210 ms|0.209 ms|*-46%*|*-46%*|NVMe read latency — 
reads wait behind writeback queue|
|*Device w_await*|9.14 ms|1.04 ms|1.08 ms|*-89%*|*-88%*|NVMe write latency — 
writeback batching vs direct writes|
|*Device aqu-sz*|11.90|1.52|3.00|*-87%*|*-75%*|I/O queue depth — writeback 
floods the queue|

*Write-DIO* = compaction writes via O_DIRECT, reads buffered (trunk).
*Both-DIO* = compaction reads + writes via O_DIRECT 
({{{}CASSANDRA-21134-21147-combined{}}} branch).

Both-DIO matches write-only DIO at p99 but delivers additional kernel health 
improvements: 55% less reclaim stall time and 2x page cache hit ratio. The p99 
equivalence is explained by increased device queue depth (1.52 → 3.00) from 
compaction reads bypassing the page cache and hitting the device directly — 
{*}this device contention offsets the stall reduction{*}.

> Direct IO support for compaction writes
> ---------------------------------------
>
>                 Key: CASSANDRA-21134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21134
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: image-2026-02-11-17-22-58-361.png, 
> image-2026-02-11-17-25-58-329.png
>
>
> Follow-up from the implementation for compaction reads (CASSANDRA-19987)
> Notable points
>  * Update the start-up check that impacts DIO writes 
> ({_}checkKernelBug1057843{_})
>  * RocksDB uses 1 MB flush buffer. This should be configurable and 
> performance tested (256KB vs 1MB)
>  * Introduce compaction_write_disk_access_mode / 
> backgroud_write_disk_access_mode
>  * Support for the compressed path would be most beneficial



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-21134) Direct IO support for compaction writes

Reply via email to