jayakasadev opened a new issue, #3228:
URL: https://github.com/apache/iggy/issues/3228
## Description
iggy currently persists everything to local disk: per-partition append-only
\`.log\` segments, sparse \`.index\` files, an append-only state log,
per-consumer offset files, system info, and tokens. This issue proposes an
opt-in mode where an S3-compatible object store is the only persistence medium
— useful for ephemeral / scale-to-zero compute, durable-by-default archive, and
deployments where local NVMe provisioning is the bottleneck.
Local-disk mode is unchanged and remains the default; the S3 backend ships
behind a default-off cargo feature so fs-only deployments aren't affected.
## Component
Iggy server
## Proposed solution
A new \`ObjectStorage\` trait abstracts persistence, with a 10-phase
incremental rollout. Each phase migrates one persistence subsystem onto the
trait and is independently mergeable.
The S3 client is **compio-native**: built on \`rusty-s3\` (sans-IO SigV4 +
request shaping) + \`cyper\` (compio HTTP client, rustls TLS). Critically, this
avoids reintroducing tokio into the data path that #2020 just removed.
### Phase plan
| Phase | Scope |
|-------|-------|
| 0 | Pre-flight feasibility spike — validate rusty-s3 + cyper + compio +
rustls against real AWS S3. **Done — 5/5 scenarios passed.** |
| 1 | \`ObjectStorage\` trait + \`CompioFsStorage\` + \`InMemoryStorage\`
(test) + \`S3Storage\` (feature-gated) + \`BufferedMultipartWriter\`. Seam only
— no production callers yet. ← **this issue's first deliverable** |
| 2 | State log + \`FileSystemInfoStorage\` + tokens onto the trait (journal
+ snapshot model on object backend). |
| 3 | Segment writes via multipart upload. |
| 4 | Segment reads via ranged GET + LRU byte cache (active-segment reads
keep using the in-flight buffer). |
| 5 | Bootstrap, directory ops, per-partition versioned manifests for fast
boot. |
| 6 | Consumer offsets repacked into one binary object per partition. |
| 7 | Retention + segment deletion. |
| 8 | Per-partition lease object (S3 conditional PUT) for split-brain
safety. |
| 9 | Hardening — retries, Prometheus metrics, IAM template, perf
benchmarks. |
| 10 | Documentation, sample config, release notes. |
The S3 backend ships behind a default-off \`object-storage\` cargo feature,
so fs-only deployments don't pull \`rusty-s3\` / \`cyper\` / \`url\` into their
dependency graph.
### Phase 0 spike outcome
A throwaway feasibility spike (~330 LoC) ran against real AWS S3 in an
ephemeral bucket (us-east-1; 1-day lifecycle backstop; bucket torn down on
exit). All five scenarios passed:
| Scenario | Latency |
|---|---|
| PUT 1 KiB | 108 ms |
| Range-GET 256 B | 33 ms |
| Multipart 12 MiB upload (3 parts) | 1555 ms |
| Full GET + byte-compare 12 MiB | 919 ms |
| Conditional PUT race (\`If-None-Match: *\`) | 62 ms; loser fenced cleanly
with HTTP 412 |
Three correctness findings, baked into Phase 1:
1. rustls 0.23 needs an explicit \`CryptoProvider::install_default()\` when
cyper is configured \`default-features = false\`.
2. AWS ETags arrive wrapped in quotes; \`rusty-s3\`'s
\`complete_multipart_upload\` re-wraps them when serializing the XML body, so
the ETag must be \`.trim_matches('"')\`-stripped before being passed back.
Otherwise: \`400 InvalidPart\`.
3. Multipart minimum part size is 5 MiB except the final part; iggy's
typical sub-MiB flushes need a buffering layer (\`BufferedMultipartWriter\`) to
coalesce them into legal parts.
### Phase 1 PRs
- #3226 — Phase 1a: \`ObjectStorage\` trait + \`CompioFsStorage\` +
\`InMemoryStorage\` + \`[system.storage]\` config + bootstrap seam. No
\`rusty-s3\` / \`cyper\` deps yet. No-op for fs deployments.
- #3227 — Phase 1b (stacks on #3226): \`S3Storage\` +
\`BufferedMultipartWriter\` behind the \`object-storage\` cargo feature.
Includes an \`IGGY_TEST_MINIO\`-gated wire test.
## Alternatives considered
1. **Build on \`opendal\` instead of \`rusty-s3\` + \`cyper\`.** Tempting
because opendal supports more backends out-of-the-box (GCS native, Azure
native), but opendal is tokio-native. Pulling it in would require either
reintroducing tokio into iggy's data-path runtime (undoing #2020) or running
opendal behind a per-call thread bridge (channel-hop overhead per S3 call, plus
an extra runtime). Rejected.
2. **Use \`rust-s3\` (already a transitive dep via the iceberg sink).** Also
tokio-based via reqwest. Same problem. Left in place for the existing iceberg
connector; not used for the new path.
3. **Roll a thin compio-native HTTP client over \`compio::net::TcpStream\` +
\`rustls\` directly.** Minimal external surface but ~300 LoC of HTTP/SigV4
plumbing to maintain in-tree. Reserved as a fallback if \`cyper\` later turns
out to have sharp edges; the Phase 0 spike confirmed it doesn't, today.
## Open questions for maintainers
- **Issue cadence.** Happy to file separate issues per phase
(one-issue-one-PR) if you prefer that to a single umbrella. This issue is
intended to anchor design discussion for the milestone; each phase still ships
in its own PR(s).
- **Default \`multipart_part_size\`.** Currently 8 MiB (configurable). AWS
minimum is 5 MiB except final. Smaller → finer durability + more S3 PUTs;
larger → fewer PUTs + larger memory buffers. 8 MiB is a starting guess;
alternatives 5 / 16 / 32.
- **\`ack_after_upload\` default.** True (producer ack waits for part-upload
success — durable before producer learns) is the safe default. False is faster
but loses messages on crash before next flush; intended for testing only.
Reasonable?
- **GCS-S3-compat support.** Phase 8 (fencing) uses S3 \`If-None-Match: *\`
conditional PUT. AWS / MinIO / R2 / Tigris support this; GCS-S3-compat does
**not**. OK to document GCS as not-supported in Phase 8 and revisit later, or
should we adopt a different fencing mechanism (paid coordination service)?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]