samueldlightfoot opened a new pull request, #4814:
URL: https://github.com/apache/cassandra/pull/4814
# CASSANDRA-21134: Direct I/O for background SSTable writes
## Summary
Adds an opt-in `O_DIRECT` write path for background SSTable producers,
bypassing the OS
page cache for data that is unlikely to be re-read soon after being written.
Memtable
flush data stays buffered (it is hot and benefits from the page cache).
Enabled via a new YAML knob:
```yaml
background_write_disk_access_mode: direct # default: standard
direct_write_buffer_size: 256KiB # aligned up to FS block size;
auto-grows to chunk_length
```
The path is gated by:
1. config (`background_write_disk_access_mode == direct`),
2. table compression being enabled (required — uncompressed writers still
use the buffered
path), and
3. an `OperationType`-keyed allowlist (`DataComponent#DIRECT_WRITE_SUPPORT`).
Selection happens centrally in `DataComponent.buildWriter`; producers are
unchanged.
## Operations covered (DIO eligible)
| OperationType | Rationale |
| --------------------- | ------------------------------------------ |
| `COMPACTION` | append-only writer |
| `MAJOR_COMPACTION` | " |
| `TOMBSTONE_COMPACTION`| " |
| `ANTICOMPACTION` | " |
| `GARBAGE_COLLECT` | " |
| `CLEANUP` | " |
| `UPGRADE_SSTABLES` | " |
| `WRITE` | " |
| `STREAM` | chunked-receiver path (see ZCS exclusion) |
The allowlist is exhaustive: any new `OperationType` with `writesData ==
true` that is not
classified will fail static initialization (`AssertionError`).
## Operations NOT covered
| Path | Classification | Reason
|
| ----------------------------- | ------------------------- |
-------------------------------------------------------------------------------------------------------
|
| `FLUSH` (memtable flush) | `UNSUPPORTED_POLICY` | Just-flushed
data is hot — keep it in the page cache. Memtable flushes always use buffered
I/O. |
| `SCRUB` | `UNSUPPORTED_CORRECTNESS` | `tryAppend`
needs `mark()` / `resetAndTruncate()`, which the DIO writer cannot satisfy.
|
| Zero-Copy Streaming (ZCS) | n/a (path bypass) | Entire-SSTable
streaming does not go through `DataComponent.buildWriter`; the DIO gate never
runs. |
| Uncompressed writers | n/a (path bypass) | Only
`CompressedSequentialWriter` has a DIO subclass in this change.
|
Removing a `UNSUPPORTED_CORRECTNESS` entry requires code changes; removing
`UNSUPPORTED_POLICY` is a configuration / policy decision.
## Key code
- `io/DirectIoSupport.java` — eligibility enum (`SUPPORTED` /
`UNSUPPORTED_CORRECTNESS` /
`UNSUPPORTED_POLICY` / `NOT_APPLICABLE`).
- `io/sstable/format/DataComponent.java` — central selection + allowlist +
exhaustiveness
check; first activation per op is logged.
- `io/compress/DirectCompressedSequentialWriter.java` — new writer; aligned
buffers, no
`mark()`/`resetAndTruncate()`.
- `io/compress/CompressedSequentialWriter.java` — refactored to allow the
DIO subclass to
override the write chunk path; `writeChunk` contract documented and
asserted.
- `config/Config.java`, `config/DatabaseDescriptor.java` — new knobs,
validation, and
startup wiring; buffer size aligned to FS block size and auto-grown to
chunk length.
- `service/StartupChecks.java` — fails fast if `direct` is requested on a
platform/FS that
does not support `O_DIRECT`.
## Tests introduced
- `DirectCompressedSequentialWriterTest` (unit, 818 lines) — covers the DIO
writer in
isolation: chunk-boundary alignment, buffer auto-expansion to chunk
length, abort/close
paths, checksum + compression-info component correctness, error handling.
- `DataComponentDirectWriteSelectionTest` (unit) — verifies the selection
matrix:
per-OperationType eligibility, exhaustiveness assertion,
compression-enabled gate,
config-mode gate.
- `StreamingDirectWriteTest` (in-JVM distributed) — proves chunked streaming
(`CassandraStreamReader` / `CassandraCompressedStreamReader` →
`BigTableWriter.openDataWriter` → `OperationType.STREAM`) selects the DIO
writer when
enabled; ZCS is disabled in the test since it bypasses the selection point.
- `DirectIoTestUtils` — shared helpers (FS block size, alignment) for the
suites above.
- `AntiCompactionTest`, `CompactionsTest` — extended to exercise the DIO
path end-to-end
for the compaction operations in the allowlist.
- `DatabaseDescriptorTest` — validation of the new knobs (mode parsing,
buffer-size
alignment, defaults).
## Not in scope
- Direct I/O on the read path.
- Uncompressed SSTable writers.
- ZCS streaming.
- Memtable flush.
patch by Sam Lightfoot; reviewed by <Reviewers> for CASSANDRA21134
```
The [Cassandra Jira](https://issues.apache.org/jira/browse/CASSANDRA-21134)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]