Vicky Papavasileiou created FLINK-35620: -------------------------------------------
Summary: Parquet writer creates wrong file for nested fields Key: FLINK-35620 URL: https://issues.apache.org/jira/browse/FLINK-35620 Project: Flink Issue Type: Bug Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile) Affects Versions: 1.19.0 Reporter: Vicky Papavasileiou After PR [https://github.com/apache/flink/pull/24795] got merged that added support for nested arrays, the parquet writer produces wrong parquet files that cannot be read. Note, the reader (both flink and iceberg) don't throw an exception but return `null` for the nested field. The error is in how the field `max_definition_level` is populated for nested fields. Consider Avro schema: ``` { "namespace": "com.test", "type": "record", "name": "RecordData", "fields": [ { "name": "Field1", "type": { "type": "array", "items": { "type": "record", "name": "NestedField2", "fields": [ { "name": "NestedField3", "type": "double" } ] } } } ] } ``` Consider the excerpt below of a parquet file produced by Flink for the above schema: ``` ############ Column(SegmentStartTime) ############ name: NestedField3 path: Field1.list.element.NestedField3 max_definition_level: 1 max_repetition_level: 1 physical_type: DOUBLE logical_type: None converted_type (legacy): NONE compression: SNAPPY (space_saved: 7%) ``` The max_definition_level should be 4 but is 1 -- This message was sent by Atlassian Jira (v8.20.10#820010)