samueldlightfoot opened a new pull request, #4814:
URL: https://github.com/apache/cassandra/pull/4814

   # CASSANDRA-21134: Direct I/O for background SSTable writes
   
   ## Summary
   
   Adds an opt-in `O_DIRECT` write path for background SSTable producers, 
bypassing the OS
   page cache for data that is unlikely to be re-read soon after being written. 
Memtable
   flush data stays buffered (it is hot and benefits from the page cache).
   
   Enabled via a new YAML knob:
   
   ```yaml
   background_write_disk_access_mode: direct    # default: standard
   direct_write_buffer_size: 256KiB              # aligned up to FS block size; 
auto-grows to chunk_length
   ```
   
   The path is gated by:
   
   1. config (`background_write_disk_access_mode == direct`),
   2. table compression being enabled (required — uncompressed writers still 
use the buffered
      path), and
   3. an `OperationType`-keyed allowlist (`DataComponent#DIRECT_WRITE_SUPPORT`).
   
   Selection happens centrally in `DataComponent.buildWriter`; producers are 
unchanged.
   
   ## Operations covered (DIO eligible)
   
   | OperationType         | Rationale                                  |
   | --------------------- | ------------------------------------------ |
   | `COMPACTION`          | append-only writer                         |
   | `MAJOR_COMPACTION`    | "                                          |
   | `TOMBSTONE_COMPACTION`| "                                          |
   | `ANTICOMPACTION`      | "                                          |
   | `GARBAGE_COLLECT`     | "                                          |
   | `CLEANUP`             | "                                          |
   | `UPGRADE_SSTABLES`    | "                                          |
   | `WRITE`               | "                                          |
   | `STREAM`              | chunked-receiver path (see ZCS exclusion)  |
   
   The allowlist is exhaustive: any new `OperationType` with `writesData == 
true` that is not
   classified will fail static initialization (`AssertionError`).
   
   ## Operations NOT covered
   
   | Path                          | Classification            | Reason         
                                                                                
         |
   | ----------------------------- | ------------------------- | 
-------------------------------------------------------------------------------------------------------
 |
   | `FLUSH` (memtable flush)      | `UNSUPPORTED_POLICY`      | Just-flushed 
data is hot — keep it in the page cache. Memtable flushes always use buffered 
I/O.         |
   | `SCRUB`                       | `UNSUPPORTED_CORRECTNESS` | `tryAppend` 
needs `mark()` / `resetAndTruncate()`, which the DIO writer cannot satisfy.     
            |
   | Zero-Copy Streaming (ZCS)     | n/a (path bypass)         | Entire-SSTable 
streaming does not go through `DataComponent.buildWriter`; the DIO gate never 
runs.      |
   | Uncompressed writers          | n/a (path bypass)         | Only 
`CompressedSequentialWriter` has a DIO subclass in this change.                 
                   |
   
   Removing a `UNSUPPORTED_CORRECTNESS` entry requires code changes; removing
   `UNSUPPORTED_POLICY` is a configuration / policy decision.
   
   ## Key code
   
   - `io/DirectIoSupport.java` — eligibility enum (`SUPPORTED` / 
`UNSUPPORTED_CORRECTNESS` /
     `UNSUPPORTED_POLICY` / `NOT_APPLICABLE`).
   - `io/sstable/format/DataComponent.java` — central selection + allowlist + 
exhaustiveness
     check; first activation per op is logged.
   - `io/compress/DirectCompressedSequentialWriter.java` — new writer; aligned 
buffers, no
     `mark()`/`resetAndTruncate()`.
   - `io/compress/CompressedSequentialWriter.java` — refactored to allow the 
DIO subclass to
     override the write chunk path; `writeChunk` contract documented and 
asserted.
   - `config/Config.java`, `config/DatabaseDescriptor.java` — new knobs, 
validation, and
     startup wiring; buffer size aligned to FS block size and auto-grown to 
chunk length.
   - `service/StartupChecks.java` — fails fast if `direct` is requested on a 
platform/FS that
     does not support `O_DIRECT`.
   
   ## Tests introduced
   
   - `DirectCompressedSequentialWriterTest` (unit, 818 lines) — covers the DIO 
writer in
     isolation: chunk-boundary alignment, buffer auto-expansion to chunk 
length, abort/close
     paths, checksum + compression-info component correctness, error handling.
   - `DataComponentDirectWriteSelectionTest` (unit) — verifies the selection 
matrix:
     per-OperationType eligibility, exhaustiveness assertion, 
compression-enabled gate,
     config-mode gate.
   - `StreamingDirectWriteTest` (in-JVM distributed) — proves chunked streaming
     (`CassandraStreamReader` / `CassandraCompressedStreamReader` →
     `BigTableWriter.openDataWriter` → `OperationType.STREAM`) selects the DIO 
writer when
     enabled; ZCS is disabled in the test since it bypasses the selection point.
   - `DirectIoTestUtils` — shared helpers (FS block size, alignment) for the 
suites above.
   - `AntiCompactionTest`, `CompactionsTest` — extended to exercise the DIO 
path end-to-end
     for the compaction operations in the allowlist.
   - `DatabaseDescriptorTest` — validation of the new knobs (mode parsing, 
buffer-size
     alignment, defaults).
   
   ## Not in scope
   
   - Direct I/O on the read path.
   - Uncompressed SSTable writers.
   - ZCS streaming.
   - Memtable flush.
   
   
   patch by Sam Lightfoot; reviewed by <Reviewers> for CASSANDRA21134
   
   
   ```
   
   The [Cassandra Jira](https://issues.apache.org/jira/browse/CASSANDRA-21134)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to