[I] Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery with kafka input format wrapping avro_stream (druid)

via GitHub Wed, 15 Apr 2026 22:25:53 -0700


greenwich opened a new issue, #19332:
URL: https://github.com/apache/druid/issues/19332


   Please provide a detailed title (e.g. "Broker crashes when using TopN query 
with Bound filter" instead of just "Broker crashes").
   
   ### Affected Version
   31.0.0 
   
   ### Description
   
     When ingesting Kafka data using the kafka input format wrapping 
avro_stream with Schema Registry, nested Avro fields lose their type 
information during auto-flattening and become VARCHAR,   
     even with useSchemaDiscovery: true.
                                                                                
                                                                                
                                   
     Setup:                                                                     
                      
     - Single-node Druid 31.0.0 cluster (Helm chart, MiddleManager mode)
     - Kafka supervisor with kafka input format + avro_stream value format + 
Confluent Schema Registry
     - dimensionsSpec: { useSchemaDiscovery: true }                             
                      
     - No flattenSpec defined                      
   
     Avro schema (simplified):                                                  
                                                                                
                                   
     {"name": "items", "type": {"type": "array", "items": {"type": "record", 
"fields": [
       {"name": "price", "type": ["null", "double"]},                           
                                                                                
                                   
       {"name": "quantity", "type": ["null", "long"]}                           
                      
     ]}}}                                                                       
                                                                                
                                   
      
     Expected behaviour:                                                        
                                                                                
                                    
     With useSchemaDiscovery: true, auto-flattened fields like items_0_price_ 
should be detected as DOUBLE, and items_0_quantity_ as BIGINT.
                                                                                
                                                                                
                                   
     Actual behaviour:
     - Top-level Avro fields are typed correctly (e.g. createdNanos → BIGINT)   
                                                                                
                                   
     - Nested/array fields are auto-flattened with underscore notation (e.g. 
items_0_price_, items_0_quantity_, metadata_someField_) but typed as VARCHAR    
                                      
     - Values are string representations: "217000.0" instead of 217000.0
     - This affects ~110 out of 118 columns in our table — only 4 are BIGINT 
(all top-level)                                                                 
                                      
                                                                                
                      
     Verified with:                                                             
                                                                                
                                   
     SELECT COLUMN_NAME, DATA_TYPE                                              
                      
     FROM INFORMATION_SCHEMA.COLUMNS                                            
                                                                                
                                   
     WHERE TABLE_NAME = 'my_datasource'
                                                                                
                                                                                
                                   
     SELECT "items_0_price_" FROM "my_datasource" LIMIT 1                       
                      
     -- Returns: "217000.0" (string, not numeric)                               
                                                                                
                                   
      
     Impact:                                                                    
                                                                                
                                   
     All arithmetic on nested numeric fields requires explicit CAST:            
                      
     CAST("items_0_quantity_" AS DOUBLE) * CAST("items_0_price_" AS DOUBLE) AS 
notional
   
     Supervisor config (relevant parts):                                        
                                                                                
                                   
     {
       "inputFormat": {                                                         
                                                                                
                                   
         "type": "kafka",                                                       
                      
         "headerFormat": { "type": "string" },
         "keyFormat": { "type": "json" },     
         "valueFormat": {                
           "type": "avro_stream",                                               
                                                                                
                                   
           "avroBytesDecoder": { 
             "type": "schema_registry",                                         
                                                                                
                                   
             "url": "http://schema-registry:8081/";                              
                      
           }
         }                                                                      
                                                                                
                                   
       },
       "dimensionsSpec": {                                                      
                                                                                
                                   
         "useSchemaDiscovery": true                                             
                      
       }
     }
   
     Workaround:
     Define explicit flattenSpec with useFieldDiscovery: true for top-level 
fields and manually list every nested path. Impractical for wide schemas with 
100+ nested fields.
                                                                                
                                                                                
                                   
     Questions:
     1. Is this expected — does auto-flattening serialise nested values to 
strings before schema discovery runs?                                           
                                        
     2. Is there a configuration to preserve Avro types for auto-flattened 
nested fields without a manual flattenSpec?                                     
                                        
     3. Would a PR to propagate Avro schema types through the flattening step 
be welcome?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[I] Nested Avro fields flattened as VARCHAR instead of native types when using useSchemaDiscovery with kafka input format wrapping avro_stream (druid)

Reply via email to