[jira] [Comment Edited] (CASSANDRA-21134) Direct IO support for compaction writes

Sam Lightfoot (Jira) Thu, 05 Mar 2026 03:00:10 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-21134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18057927#comment-18057927
 ]


Sam Lightfoot edited comment on CASSANDRA-21134 at 3/5/26 10:59 AM:
--------------------------------------------------------------------

Preliminary test results for unthrottled cursor compaction with direct IO for 
compaction writes:
 * Memory: 12GB
 * Max Heap: 6GB
 * Hot dataset size: 1GB
 * Compaction reads: buffered

h2. Direct I/O Compaction Benchmark: Unthrottled vs Throttled (128 MiB/s)

*Setup:* 5.1-SNAPSHOT ({{{}CASSANDRA-21134{}}} + {{CASSANDRA-21147}} combined 
branch), 64 GB RAM, RAID1 NVMe, 12 GB cgroup, 4 GB heap. 128 GB compaction 
input (2 x 64 GB SSTables). 10K reads/s concurrent workload over 1M-key hot set 
for 10 minutes during compaction. HdrHistogram latency capture.

*Modes tested:*
 * *Buffered* — stock Cassandra (all compaction I/O through page cache).
 * *Write-DIO* — compaction writes via {{{}O_DIRECT{}}}, reads buffered 
({{{}CASSANDRA-21134{}}} trunk patch).
 * *Both-DIO* — compaction reads + writes via {{O_DIRECT}} 
({{{}CASSANDRA-21134{}}} + {{CASSANDRA-21147}} combined branch).

Baselines (no compaction) are identical across all three modes: p99 ~1.0 ms, 
mean ~0.5 ms. All degradation shown below is caused by concurrent compaction.
----
h3. 1. Unthrottled Compaction

Compaction runs at full device speed (~246–286 MiB/s). This is the worst-case 
scenario for page cache pollution.
||Metric||Buffered||Write-DIO||Both-DIO||Write-DIO vs Buffered||Both-DIO vs 
Buffered||
|p50 read latency|0.48 ms|0.48 ms|0.49 ms|—|—|
|*p99 read latency*|*16.2 ms*|*1.9 ms*|*1.90 ms*|*8.5x lower*|*8.5x lower*|
|*p99.9 read latency*|*100.6 ms*|*7.9 ms*|*8.47 ms*|*12.7x lower*|*11.9x lower*|
|Mean read latency|1.29 ms|0.61 ms|0.61 ms|2.1x lower|2.1x lower|
|*Dirty pages/s*|*42,862*|*1,038*|*874*|*41x reduction*|*49x reduction*|
|*Stall time (us/s)*|*31,599*|*21,534*|*14,170*|*-32%*|*-55%*|
|Cache hit ratio|16.1%|17.6%|34.5%|+1.5 pp|+18.4 pp (2.1x)|
|Active file cache|471 MB|488 MB|433 MB|+4%|-8%|
|*NVMe write latency (w_await)*|*9.14 ms*|*1.04 ms*|*1.08 ms*|*8.8x 
lower*|*8.5x lower*|
|NVMe read latency (r_await)|0.386 ms|0.210 ms|0.209 ms|-46%|-46%|
|NVMe queue depth (aqu-sz)|11.90|1.52|3.00|7.8x shallower|4.0x shallower|
|Compaction throughput|246 MiB/s|286 MiB/s|~283 MiB/s|+16%|+15%|

Both-DIO matches Write-DIO at p99 but delivers additional kernel health 
improvements: 55% less reclaim stall time and 2.1x page cache hit ratio. The 
p99 equivalence is explained by increased device queue depth (1.52 → 3.00) from 
compaction reads bypassing the page cache and hitting the device directly — 
this device contention offsets the stall reduction.
----
h3. 2. Throttled Compaction (128 MiB/s)

Compaction throughput capped at 128 MiB/s (2x the production default of 64 
MiB/s). This demonstrates that Direct I/O benefits hold under realistic 
compaction rate limits.
||Metric||Buffered||Write-DIO||Both-DIO||Write-DIO vs Buffered||Both-DIO vs 
Buffered||
|p50 read latency|0.49 ms|0.50 ms|0.46 ms|—|—|
|p90 read latency|0.90 ms|0.90 ms|0.84 ms|—|—|
|*p99 read latency*|*9.63 ms*|*1.84 ms*|*1.78 ms*|*5.2x lower*|*5.4x lower*|
|*p99.9 read latency*|*51.64 ms*|*4.92 ms*|*5.01 ms*|*10.5x lower*|*10.3x 
lower*|
|*p99.99 read latency*|*98.57 ms*|*13.37 ms*|*12.65 ms*|*7.4x lower*|*7.8x 
lower*|
|Mean read latency|0.93 ms|0.59 ms|0.55 ms|1.6x lower|1.7x lower|
|Max read latency|245.37 ms|30.28 ms|36.18 ms|8.1x lower|6.8x lower|
|StdDev|3.26 ms|0.41 ms|0.40 ms|8.0x lower|8.2x lower|
|Total reads|5,730,270|5,730,158|5,730,246| | |
|*Dirty pages/s*|*25,211*|*544*|*556*|*46x reduction*|*45x reduction*|
|*Stall time (us/s)*|*32,645*|*24,790*|*13,775*|*-24%*|*-58%*|
|Cache hit ratio|27.2%|29.8%|36.0%|+2.6 pp|+8.8 pp (+33% relative)|
|Cache misses/s|66,277|60,483|48,969|-9%|-26%|
|Active file cache|452 MB|478 MB|641 MB|+6%|+42%|
|*NVMe write latency (w_await)*|*5.14 ms*|*0.46 ms*|*0.45 ms*|*11.2x 
lower*|*11.4x lower*|
|NVMe read latency (r_await)|0.258 ms|0.169 ms|0.171 ms|-34%|-34%|
|NVMe queue depth (aqu-sz)|5.81|0.91|0.97|6.4x shallower|6.0x shallower|
|Device utilisation|99.2%|98.5%|99.7%|~same|~same|
|Compaction throughput|128 MiB/s|128 MiB/s|128 MiB/s|same|same|

Under throttling, all three modes hit 128.0 MiB/s exactly. The DIO advantage is 
purely in read latency and kernel health — not throughput.
----
h3. 3. Throttled vs Unthrottled Comparison
||Metric||Unthrottled Buffered||Throttled Buffered||Unthrottled DIO 
(Both)||Throttled DIO (Both)||
|p99 read latency|16.2 ms|9.63 ms|1.90 ms|1.84 ms|
|p99.9 read latency|100.6 ms|51.64 ms|8.47 ms|5.01 ms|
|Dirty pages/s|42,862|25,211|874|556|
|Stall time (us/s)|31,599|32,645|14,170|13,775|
|DIO p99 improvement|8.5x|5.2x|—|—|
|DIO p99.9 improvement|11.9x|10.3x|—|—|

Throttling reduces buffered damage (lower dirty page rate per second), but DIO 
latency is nearly unchanged (1.90 → 1.84 ms at p99) because DIO's latency floor 
is set by device contention, not memory pressure.

Both-DIO matches write-only DIO at p99 but delivers additional kernel health 
improvements: 55% less reclaim stall time and 2x page cache hit ratio. The p99 
equivalence is explained by increased device queue depth (1.52 → 3.00) from 
compaction reads bypassing the page cache and hitting the device directly — 
this device contention offsets the stall reduction.


was (Author: JIRAUSER302824):
Preliminary test results for unthrottled cursor compaction with direct IO for 
compaction writes:
 * Memory: 12GB
 * Max Heap: 6GB
 * Hot dataset size: 1GB
 * Compaction reads: buffered

 

_Read Latency During Compaction_
|Percentile|Buffered #1 (ms)|Buffered #2 (ms)|Direct IO #1 (ms)|Direct IO #2 
(ms)|
|—|—|—|—|—|
|p50|0.48|0.48|0.48|0.48|
|p90|1.36|1.34|0.98|0.97|
|p99|16.52|15.79|1.94|1.87|
|p99.9|141.6|59.5|8.19|7.60|
|p99.99|190.8|96.5|27.5|23.3|
|Mean|1.41|1.16|0.61|0.61|
|StdDev|7.74|3.94|0.67|0.61|
|Max|285.2|160.4|86.5|80.7|
|Total reads|5,730,193|5,730,138|5,730,235|5,730,288|

_Summary_
||Metric||Buffered (avg)||Write-DIO (avg)||Both-DIO (avg)||Write-DIO vs 
Buffered||Both-DIO vs Buffered||Notes||
|*p99 read latency*|16.2 ms|1.9 ms|1.90 ms|*8.5x*|*8.5x*| |
|*p99.9 read latency*|100.6 ms|7.9 ms|8.47 ms|*12.7x*|*11.9x*| |
|*Mean read latency*|1.29 ms|0.61 ms|0.61 ms|*2.1x*|*2.1x*| |
|*Stall_us/s*|31,599|21,534|14,170|*-32%*|*-55%*|Time app blocks waiting for 
kernel page reclaim|
|*Cache Hit Ratio*|16.1%|17.6%|34.5%|*+1.5pp*|*+18.4pp (2x)*|Both-DIO: 
compaction reads no longer pollute page cache|
|*Compaction throughput*|246 MiB/s|286 MiB/s|~283 MiB/s|*+16%*|*+15%*| |
|*Cache Dirty Writes/s*|42,862|1,038|874|*-97.6%*|*-98.0%*|Dirty pages/s 
entering page cache|
|*Device r_await*|0.386 ms|0.210 ms|0.209 ms|*-46%*|*-46%*|NVMe read latency|
|*Device w_await*|9.14 ms|1.04 ms|1.08 ms|*-89%*|*-88%*|NVMe write latency|
|*Device aqu-sz*|11.90|1.52|3.00|*-87%*|*-75%*|I/O queue depth|
|p50 read latency|0.48 ms|0.48 ms|0.49 ms|—|—|Median unaffected — damage is 
tail-only|
|Active_File|471 MB|488 MB|433 MB|+4%|-8%|Page cache pages actively referenced 
by reads|

*Write-DIO* = compaction writes via O_DIRECT, reads buffered (trunk).
*Both-DIO* = compaction reads + writes via O_DIRECT 
({{CASSANDRA-21134-21147-combined}} branch).

Both-DIO matches write-only DIO at p99 but delivers additional kernel health 
improvements: 55% less reclaim stall time and 2x page cache hit ratio. The p99 
equivalence is explained by increased device queue depth (1.52 → 3.00) from 
compaction reads bypassing the page cache and hitting the device directly — 
this device contention offsets the stall reduction.

> Direct IO support for compaction writes
> ---------------------------------------
>
>                 Key: CASSANDRA-21134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21134
>             Project: Apache Cassandra
>          Issue Type: Improvement
>          Components: Local/Compaction
>            Reporter: Sam Lightfoot
>            Assignee: Sam Lightfoot
>            Priority: Normal
>             Fix For: 5.x
>
>         Attachments: image-2026-02-11-17-22-58-361.png, 
> image-2026-02-11-17-25-58-329.png
>
>
> Follow-up from the implementation for compaction reads (CASSANDRA-19987)
> Notable points
>  * Update the start-up check that impacts DIO writes 
> ({_}checkKernelBug1057843{_})
>  * RocksDB uses 1 MB flush buffer. This should be configurable and 
> performance tested (256KB vs 1MB)
>  * Introduce compaction_write_disk_access_mode / 
> backgroud_write_disk_access_mode
>  * Support for the compressed path would be most beneficial



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-21134) Direct IO support for compaction writes

Reply via email to