TP Boudreau created ARROW-8127:
----------------------------------
Summary: [C++} [Parquet] Incorrect column chunk metadata for
multipage batch writes
Key: ARROW-8127
URL: https://issues.apache.org/jira/browse/ARROW-8127
Project: Apache Arrow
Issue Type: Bug
Components: C++
Reporter: TP Boudreau
Assignee: TP Boudreau
Attachments: multipage-batch-write.cc
When writing to a buffered column writer using PLAIN encoding, if the size of
the batch supplied for writing exceeds the page size for the writer, the
resulting file has an incorrect data_page_offset set in its column chunk
metadata. This causes an exception to be thrown when reading the file (file
appears to be too short to the reader).
For example, the attached code, which attempts to write a batch of 262145
Int32's (= 1048576 + 4 bytes) using the default page size of 1048576 bytes
(with buffered writer, PLAIN encoding), fails on reading, throwing the error:
"Tried reading 1048678 bytes starting at position 1048633 from file but only
got 333".
The error is caused by the second page write tripping the conditional here
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc#L302,
in the serialized in-memory writer wrapped by the buffered writer.
The fix builds the metadata with offsets from the terminal sink rather than the
in memory buffered sink. A PR is coming.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)