ryankert01 opened a new pull request, #3215:
URL: https://github.com/apache/iggy/pull/3215
## Summary
First-pass Apache Doris sink connector for #2753. Writes Iggy messages to
Doris via the HTTP Stream Load API.
**v1 scope (intentional):** JSON payloads only, HTTP Basic auth only,
pre-created tables only (no DDL). Retry middleware, circuit breaker, and
additional auth methods are deferred to a follow-up.
## Behaviour
- **Manual 307/308 redirect following** (capped at 5) so the `Authorization`
header survives the FE → BE hop. `reqwest`'s default policy strips
`Authorization` on cross-host redirects, which would otherwise break Stream
Load entirely.
- **Deterministic per-batch label** —
`{prefix}-{stream}-{topic}-{partition}-{first_offset}-{last_offset}`. Replays
of the same batch are deduplicated by Doris within `label_keep_max_second`, so
restart-replay is safe regardless of how upstream sink-offset semantics are
resolved.
- **Status-body classification** — HTTP 200 alone is not sufficient; Doris
reports outcome in the response body's `Status` field. `Success` and `Label
Already Exists` → `Ok`; `Publish Timeout` → `CannotStoreData` (transient);
`Fail` and any unknown status → `PermanentHttpError` so the runtime DLQs the
batch instead of looping.
- **Optional Stream Load knobs** — `columns`, `where`, `max_filter_ratio`,
`batch_size`, `timeout_secs` forwarded as headers.
- **Secret hygiene** — `password` is `secrecy::SecretString`; the cached
`Authorization` header is also wrapped in `SecretString` so `Debug` derivations
never leak the base64 credential.
- **Startup validation** — `reqwest::Client` and `fe_url` are built/parsed
in `open()` and surface as `Error::InitError` / `Error::InvalidConfigValue`.
`new()` is infallible to keep the FFI boundary panic-free.
## Test plan
6 integration tests under `core/integration/tests/connectors/doris/` backed
by an `apache/doris` all-in-one testcontainer (FE HTTP 8030 + FE MySQL 9030):
- [x] `given_existent_doris_table_should_store` — happy path, ~10 rows.
- [x] `given_bulk_message_send_should_store` — 1000 rows.
- [x] `given_invalid_messages_should_skip_via_max_filter_ratio` —
`max_filter_ratio = 0.5`, mix of valid/invalid JSON, only valid rows land.
- [x] `given_replayed_label_should_dedupe` — same offsets ⇒ same label ⇒
Doris must respond `Label Already Exists`; row count must not increase.
- [x] `given_missing_target_table_should_not_create_or_corrupt` — sink to a
non-existent table; assert no auto-create and no rogue tables in the test DB.
- [x] `given_columns_config_should_apply_derived_expression` — pre-create a
table with an extra `calculated INT NOT NULL` column not present in the JSON
payload, configure `columns = "..., calculated = count + 1"`, verify the
derived value lands.
Plus 14 unit tests covering URL construction, label sanitization, status
classification, and response parsing.
## Quality checks
- `cargo build -p iggy_connector_doris_sink` ✅
- `cargo test -p iggy_connector_doris_sink --lib` ✅ (14/14)
- `cargo clippy -p iggy_connector_doris_sink -p integration --all-targets
--all-features -- -D warnings` ✅
- `cargo fmt --all -- --check` ✅
- All 6 integration tests pass on Docker for Mac (Apple Silicon).
## Notes for reviewers
- **LOC:** crate + integration tests + fixture come to ~1.9k lines.
`CONTRIBUTING.md` flags 500 LOC as the cap; bundling the crate with its tests
is intentional because the Stream Load specifics (307 redirect with auth
re-attachment, response-body `Status` parsing, deterministic-label idempotency)
aren't meaningfully reviewable without the integration coverage. Happy to split
the test files into a follow-up if maintainers prefer; the natural fault line
is `core/integration/tests/connectors/doris/` +
`core/integration/tests/connectors/fixtures/doris/`.
- The `pre-commit` `licenses-list` hook regenerated `DEPENDENCIES.md` to add
the new crate row; that's the only change in that file.
Closes #2753 *(partial — sink half of the issue; source connector remains
open)*.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]