Vicky Papavasileiou created FLINK-35620:
-------------------------------------------
Summary: Parquet writer creates wrong file for nested fields
Key: FLINK-35620
URL: https://issues.apache.org/jira/browse/FLINK-35620
Project: Flink
Issue Type: Bug
Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Affects Versions: 1.19.0
Reporter: Vicky Papavasileiou
After PR [https://github.com/apache/flink/pull/24795] got merged that added
support for nested arrays, the parquet writer produces wrong parquet files that
cannot be read. Note, the reader (both flink and iceberg) don't throw an
exception but return `null` for the nested field.
The error is in how the field `max_definition_level` is populated for nested
fields.
Consider Avro schema:
```
{
"namespace": "com.test",
"type": "record",
"name": "RecordData",
"fields": [
{
"name": "Field1",
"type": {
"type": "array",
"items": {
"type": "record",
"name": "NestedField2",
"fields": [
{ "name": "NestedField3", "type": "double" }
]
}
}
}
]
}
```
Consider the excerpt below of a parquet file produced by Flink for the above
schema:
```
############ Column(SegmentStartTime) ############
name: NestedField3
path: Field1.list.element.NestedField3
max_definition_level: 1
max_repetition_level: 1
physical_type: DOUBLE
logical_type: None
converted_type (legacy): NONE
compression: SNAPPY (space_saved: 7%)
```
The max_definition_level should be 4 but is 1
--
This message was sent by Atlassian Jira
(v8.20.10#820010)