greenwich opened a new issue, #19332: URL: https://github.com/apache/druid/issues/19332
Please provide a detailed title (e.g. "Broker crashes when using TopN query
with Bound filter" instead of just "Broker crashes").
### Affected Version
31.0.0
### Description
When ingesting Kafka data using the kafka input format wrapping
avro_stream with Schema Registry, nested Avro fields lose their type
information during auto-flattening and become VARCHAR,
even with useSchemaDiscovery: true.
Setup:
- Single-node Druid 31.0.0 cluster (Helm chart, MiddleManager mode)
- Kafka supervisor with kafka input format + avro_stream value format +
Confluent Schema Registry
- dimensionsSpec: { useSchemaDiscovery: true }
- No flattenSpec defined
Avro schema (simplified):
{"name": "items", "type": {"type": "array", "items": {"type": "record",
"fields": [
{"name": "price", "type": ["null", "double"]},
{"name": "quantity", "type": ["null", "long"]}
]}}}
Expected behaviour:
With useSchemaDiscovery: true, auto-flattened fields like items_0_price_
should be detected as DOUBLE, and items_0_quantity_ as BIGINT.
Actual behaviour:
- Top-level Avro fields are typed correctly (e.g. createdNanos → BIGINT)
- Nested/array fields are auto-flattened with underscore notation (e.g.
items_0_price_, items_0_quantity_, metadata_someField_) but typed as VARCHAR
- Values are string representations: "217000.0" instead of 217000.0
- This affects ~110 out of 118 columns in our table — only 4 are BIGINT
(all top-level)
Verified with:
SELECT COLUMN_NAME, DATA_TYPE
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'my_datasource'
SELECT "items_0_price_" FROM "my_datasource" LIMIT 1
-- Returns: "217000.0" (string, not numeric)
Impact:
All arithmetic on nested numeric fields requires explicit CAST:
CAST("items_0_quantity_" AS DOUBLE) * CAST("items_0_price_" AS DOUBLE) AS
notional
Supervisor config (relevant parts):
{
"inputFormat": {
"type": "kafka",
"headerFormat": { "type": "string" },
"keyFormat": { "type": "json" },
"valueFormat": {
"type": "avro_stream",
"avroBytesDecoder": {
"type": "schema_registry",
"url": "http://schema-registry:8081/"
}
}
},
"dimensionsSpec": {
"useSchemaDiscovery": true
}
}
Workaround:
Define explicit flattenSpec with useFieldDiscovery: true for top-level
fields and manually list every nested path. Impractical for wide schemas with
100+ nested fields.
Questions:
1. Is this expected — does auto-flattening serialise nested values to
strings before schema discovery runs?
2. Is there a configuration to preserve Avro types for auto-flattened
nested fields without a manual flattenSpec?
3. Would a PR to propagate Avro schema types through the flattening step
be welcome?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]
