aho135 commented on code in PR #19571:
URL: https://github.com/apache/druid/pull/19571#discussion_r3399327675
##########
docs/ingestion/kafka-ingestion.md:
##########
@@ -263,6 +263,50 @@ The following example shows a supervisor spec with idle
configuration enabled:
```
</details>
+#### Streaming partitions spec
+
+When you set `streamingPartitionsSpec.partitionDimensions` in the tuning
config, the supervisor tracks the distinct values observed for each listed
dimension during ingestion. At segment publish time, each segment is annotated
with only the values it actually ingested. The broker then uses these
annotations to skip segments at query time when the query filter doesn't
intersect the segment's declared values.
+
+This enables segment pruning for streaming-ingested data without waiting for
compaction to produce hash or range-partitioned segments. The
`partitionDimensions` should be kept in sync with the compaction config's
`partitionDimensions` for the same datasource.
+
+**Usage guidelines:**
+
+- Use only low-to-medium cardinality dimensions (for example, `tenant_id`,
`region`, `environment`). High-cardinality dimensions bloat segment metadata
with no pruning benefit.
+- Most effective when Kafka partitions are keyed by the tracked dimension (for
example, using tenant ID as the message key). Each task naturally sees a subset
of values, and segments get tight filter annotations.
+- Also works with multiple supervisors reading from separate topics into one
datasource.
+- Use a range or hashed compaction `partitionsSpec`, not the dynamic strategy:
dynamic compaction does not partition by dimension, so it cannot preserve
pruning after compaction.
+- After compaction, the streaming pruning annotations are replaced by the
compaction output's partitioning (hash or range), which provides its own
pruning.
+
+The following example configures a supervisor to track the `tenant` dimension:
+
+```json
+{
+ "type": "kafka",
+ "spec": {
+ "dataSchema": {
+ "dataSource": "multi_tenant_events",
+ "timestampSpec": {"column": "timestamp", "format": "iso"},
+ "dimensionsSpec": {"dimensions": ["tenant", "region", "event_type"]},
+ "granularitySpec": {"type": "uniform", "segmentGranularity": "HOUR",
"queryGranularity": "NONE"}
+ },
+ "ioConfig": {
+ "type": "kafka",
+ "topic": "events",
+ "consumerProperties": {"bootstrap.servers": "localhost:9092"},
+ "inputFormat": {"type": "json"},
+ "taskCount": 4,
+ "taskDuration": "PT1H"
+ },
+ "tuningConfig": {
+ "type": "kafka",
+ "streamingPartitionsSpec": {"partitionDimensions": ["tenant"]}
Review Comment:
Nice! Makes sense to have this in tuningConfig, and in the future we can add
cardinality guardrails into `streamingPartitionsSpec` as well
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]