Vicky Papavasileiou created FLINK-35620:
-------------------------------------------

             Summary: Parquet writer creates wrong file for nested fields
                 Key: FLINK-35620
                 URL: https://issues.apache.org/jira/browse/FLINK-35620
             Project: Flink
          Issue Type: Bug
          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
    Affects Versions: 1.19.0
            Reporter: Vicky Papavasileiou


After PR [https://github.com/apache/flink/pull/24795] got merged that added 
support for nested arrays, the parquet writer produces wrong parquet files that 
cannot be read. Note, the reader (both flink and iceberg) don't throw an 
exception but return `null` for the nested field. 

The error is in how the field `max_definition_level` is populated for nested 
fields. 

Consider Avro schema:

```

{
"namespace": "com.test",
"type": "record",
"name": "RecordData",
"fields": [
{
"name": "Field1",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "NestedField2",
"fields": [
{ "name": "NestedField3", "type": "double" }

]
}
}
}
]
}

```

Consider the excerpt below of a parquet file produced by Flink for the above 
schema:

```

############ Column(SegmentStartTime) ############
name: NestedField3
path: Field1.list.element.NestedField3
max_definition_level: 1
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 7%)

```

The max_definition_level should be 4 but is 1



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to