jayakasadev opened a new issue, #3228:
URL: https://github.com/apache/iggy/issues/3228

   ## Description
   
   iggy currently persists everything to local disk: per-partition append-only 
\`.log\` segments, sparse \`.index\` files, an append-only state log, 
per-consumer offset files, system info, and tokens. This issue proposes an 
opt-in mode where an S3-compatible object store is the only persistence medium 
— useful for ephemeral / scale-to-zero compute, durable-by-default archive, and 
deployments where local NVMe provisioning is the bottleneck.
   
   Local-disk mode is unchanged and remains the default; the S3 backend ships 
behind a default-off cargo feature so fs-only deployments aren't affected.
   
   ## Component
   
   Iggy server
   
   ## Proposed solution
   
   A new \`ObjectStorage\` trait abstracts persistence, with a 10-phase 
incremental rollout. Each phase migrates one persistence subsystem onto the 
trait and is independently mergeable.
   
   The S3 client is **compio-native**: built on \`rusty-s3\` (sans-IO SigV4 + 
request shaping) + \`cyper\` (compio HTTP client, rustls TLS). Critically, this 
avoids reintroducing tokio into the data path that #2020 just removed.
   
   ### Phase plan
   
   | Phase | Scope |
   |-------|-------|
   | 0 | Pre-flight feasibility spike — validate rusty-s3 + cyper + compio + 
rustls against real AWS S3. **Done — 5/5 scenarios passed.** |
   | 1 | \`ObjectStorage\` trait + \`CompioFsStorage\` + \`InMemoryStorage\` 
(test) + \`S3Storage\` (feature-gated) + \`BufferedMultipartWriter\`. Seam only 
— no production callers yet. ← **this issue's first deliverable** |
   | 2 | State log + \`FileSystemInfoStorage\` + tokens onto the trait (journal 
+ snapshot model on object backend). |
   | 3 | Segment writes via multipart upload. |
   | 4 | Segment reads via ranged GET + LRU byte cache (active-segment reads 
keep using the in-flight buffer). |
   | 5 | Bootstrap, directory ops, per-partition versioned manifests for fast 
boot. |
   | 6 | Consumer offsets repacked into one binary object per partition. |
   | 7 | Retention + segment deletion. |
   | 8 | Per-partition lease object (S3 conditional PUT) for split-brain 
safety. |
   | 9 | Hardening — retries, Prometheus metrics, IAM template, perf 
benchmarks. |
   | 10 | Documentation, sample config, release notes. |
   
   The S3 backend ships behind a default-off \`object-storage\` cargo feature, 
so fs-only deployments don't pull \`rusty-s3\` / \`cyper\` / \`url\` into their 
dependency graph.
   
   ### Phase 0 spike outcome
   
   A throwaway feasibility spike (~330 LoC) ran against real AWS S3 in an 
ephemeral bucket (us-east-1; 1-day lifecycle backstop; bucket torn down on 
exit). All five scenarios passed:
   
   | Scenario | Latency |
   |---|---|
   | PUT 1 KiB | 108 ms |
   | Range-GET 256 B | 33 ms |
   | Multipart 12 MiB upload (3 parts) | 1555 ms |
   | Full GET + byte-compare 12 MiB | 919 ms |
   | Conditional PUT race (\`If-None-Match: *\`) | 62 ms; loser fenced cleanly 
with HTTP 412 |
   
   Three correctness findings, baked into Phase 1:
   
   1. rustls 0.23 needs an explicit \`CryptoProvider::install_default()\` when 
cyper is configured \`default-features = false\`.
   2. AWS ETags arrive wrapped in quotes; \`rusty-s3\`'s 
\`complete_multipart_upload\` re-wraps them when serializing the XML body, so 
the ETag must be \`.trim_matches('"')\`-stripped before being passed back. 
Otherwise: \`400 InvalidPart\`.
   3. Multipart minimum part size is 5 MiB except the final part; iggy's 
typical sub-MiB flushes need a buffering layer (\`BufferedMultipartWriter\`) to 
coalesce them into legal parts.
   
   ### Phase 1 PRs
   
   - #3226 — Phase 1a: \`ObjectStorage\` trait + \`CompioFsStorage\` + 
\`InMemoryStorage\` + \`[system.storage]\` config + bootstrap seam. No 
\`rusty-s3\` / \`cyper\` deps yet. No-op for fs deployments.
   - #3227 — Phase 1b (stacks on #3226): \`S3Storage\` + 
\`BufferedMultipartWriter\` behind the \`object-storage\` cargo feature. 
Includes an \`IGGY_TEST_MINIO\`-gated wire test.
   
   ## Alternatives considered
   
   1. **Build on \`opendal\` instead of \`rusty-s3\` + \`cyper\`.** Tempting 
because opendal supports more backends out-of-the-box (GCS native, Azure 
native), but opendal is tokio-native. Pulling it in would require either 
reintroducing tokio into iggy's data-path runtime (undoing #2020) or running 
opendal behind a per-call thread bridge (channel-hop overhead per S3 call, plus 
an extra runtime). Rejected.
   2. **Use \`rust-s3\` (already a transitive dep via the iceberg sink).** Also 
tokio-based via reqwest. Same problem. Left in place for the existing iceberg 
connector; not used for the new path.
   3. **Roll a thin compio-native HTTP client over \`compio::net::TcpStream\` + 
\`rustls\` directly.** Minimal external surface but ~300 LoC of HTTP/SigV4 
plumbing to maintain in-tree. Reserved as a fallback if \`cyper\` later turns 
out to have sharp edges; the Phase 0 spike confirmed it doesn't, today.
   
   ## Open questions for maintainers
   
   - **Issue cadence.** Happy to file separate issues per phase 
(one-issue-one-PR) if you prefer that to a single umbrella. This issue is 
intended to anchor design discussion for the milestone; each phase still ships 
in its own PR(s).
   - **Default \`multipart_part_size\`.** Currently 8 MiB (configurable). AWS 
minimum is 5 MiB except final. Smaller → finer durability + more S3 PUTs; 
larger → fewer PUTs + larger memory buffers. 8 MiB is a starting guess; 
alternatives 5 / 16 / 32.
   - **\`ack_after_upload\` default.** True (producer ack waits for part-upload 
success — durable before producer learns) is the safe default. False is faster 
but loses messages on crash before next flush; intended for testing only. 
Reasonable?
   - **GCS-S3-compat support.** Phase 8 (fencing) uses S3 \`If-None-Match: *\` 
conditional PUT. AWS / MinIO / R2 / Tigris support this; GCS-S3-compat does 
**not**. OK to document GCS as not-supported in Phase 8 and revisit later, or 
should we adopt a different fencing mechanism (paid coordination service)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to