aho135 commented on code in PR #19571:
URL: https://github.com/apache/druid/pull/19571#discussion_r3399327675


##########
docs/ingestion/kafka-ingestion.md:
##########
@@ -263,6 +263,50 @@ The following example shows a supervisor spec with idle 
configuration enabled:
 ```
 </details>
 
+#### Streaming partitions spec
+
+When you set `streamingPartitionsSpec.partitionDimensions` in the tuning 
config, the supervisor tracks the distinct values observed for each listed 
dimension during ingestion. At segment publish time, each segment is annotated 
with only the values it actually ingested. The broker then uses these 
annotations to skip segments at query time when the query filter doesn't 
intersect the segment's declared values.
+
+This enables segment pruning for streaming-ingested data without waiting for 
compaction to produce hash or range-partitioned segments. The 
`partitionDimensions` should be kept in sync with the compaction config's 
`partitionDimensions` for the same datasource.
+
+**Usage guidelines:**
+
+- Use only low-to-medium cardinality dimensions (for example, `tenant_id`, 
`region`, `environment`). High-cardinality dimensions bloat segment metadata 
with no pruning benefit.
+- Most effective when Kafka partitions are keyed by the tracked dimension (for 
example, using tenant ID as the message key). Each task naturally sees a subset 
of values, and segments get tight filter annotations.
+- Also works with multiple supervisors reading from separate topics into one 
datasource.
+- Use a range or hashed compaction `partitionsSpec`, not the dynamic strategy: 
dynamic compaction does not partition by dimension, so it cannot preserve 
pruning after compaction.
+- After compaction, the streaming pruning annotations are replaced by the 
compaction output's partitioning (hash or range), which provides its own 
pruning.
+
+The following example configures a supervisor to track the `tenant` dimension:
+
+```json
+{
+  "type": "kafka",
+  "spec": {
+    "dataSchema": {
+      "dataSource": "multi_tenant_events",
+      "timestampSpec": {"column": "timestamp", "format": "iso"},
+      "dimensionsSpec": {"dimensions": ["tenant", "region", "event_type"]},
+      "granularitySpec": {"type": "uniform", "segmentGranularity": "HOUR", 
"queryGranularity": "NONE"}
+    },
+    "ioConfig": {
+      "type": "kafka",
+      "topic": "events",
+      "consumerProperties": {"bootstrap.servers": "localhost:9092"},
+      "inputFormat": {"type": "json"},
+      "taskCount": 4,
+      "taskDuration": "PT1H"
+    },
+    "tuningConfig": {
+      "type": "kafka",
+      "streamingPartitionsSpec": {"partitionDimensions": ["tenant"]}

Review Comment:
   Nice! Makes sense to have this in tuningConfig, and in the future we can add 
cardinality guardrails into `streamingPartitionsSpec` as well



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to