Alex Sorokoumov created FLINK-35641:
---------------------------------------
Summary: ParquetSchemaConverter should correctly handle field
optionality
Key: FLINK-35641
URL: https://issues.apache.org/jira/browse/FLINK-35641
Project: Flink
Issue Type: Bug
Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
Reporter: Alex Sorokoumov
At the moment,
[ParquetSchemaConverter|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/main/java/org/apache/flink/formats/parquet/utils/ParquetSchemaConverter.java#L64]
marks all fields as optional. This is not correct in general and especially
when it comes to handling maps. For example,
[parquet-tools|https://pypi.org/project/parquet-tools/] breaks on the Parquet
file produced by
[ParquetRowDataWriterTest#complexTypeTest|https://github.com/apache/flink/blob/99d6fd3c68f46daf0397a35566414e1d19774c3d/flink-formats/flink-parquet/src/test/java/org/apache/flink/formats/parquet/row/ParquetRowDataWriterTest.java#L140-L151]:
{noformat}
parquet-tools inspect
/var/folders/sc/k3hr87fj4x169rdq9n107whw0000gp/T/junit14646865447948471989/3b328592-7315-48c6-8fa9-38da4048fb4e
Traceback (most recent call last):
File "/Users/asorokoumov/.pyenv/versions/3.12.3/bin/parquet-tools", line 8,
in <module>
sys.exit(main())
^^^^^^
File
"/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/cli.py",
line 26, in main
args.handler(args)
File
"/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py",
line 55, in _cli
_execute_simple(
File
"/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/parquet_tools/commands/inspect.py",
line 63, in _execute_simple
pq_file: pq.ParquetFile = pq.ParquetFile(filename)
^^^^^^^^^^^^^^^^^^^^^^^^
File
"/Users/asorokoumov/.pyenv/versions/3.12.3/lib/python3.12/site-packages/pyarrow/parquet/core.py",
line 317, in __init__
self.reader.open(
File "pyarrow/_parquet.pyx", line 1492, in pyarrow._parquet.ParquetReader.open
File "pyarrow/error.pxi", line 91, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Map keys must be annotated as required.
{noformat}
The correct thing to do is to mark nullable fields as optional, otherwise
required.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)