[PR] ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata. [datafusion]

via GitHub Fri, 20 Dec 2024 17:07:35 -0800


wiedld opened a new pull request, #13866:
URL: https://github.com/apache/datafusion/pull/13866


   ## Which issue does this PR close?
   
   Closes https://github.com/apache/datafusion/issues/11770
   
   ## Rationale for this change
   
   The 
[ArrowWriter](https://docs.rs/parquet/53.3.0/parquet/arrow/arrow_writer/struct.ArrowWriter.html)
 with it's default `ArrowWriterOptions` will encode the arrow schema in the 
parquet kv_metadata, unless explicitly skipped. Skipping is done via 
[ArrowWriterOptions::with_skip_arrow_metadata](https://docs.rs/parquet/53.3.0/parquet/arrow/arrow_writer/struct.ArrowWriterOptions.html#method.with_skip_arrow_metadata).
   
   In datafusion's ParquetSink, we can write in either single threaded or 
parallelized format. When in single-threaded mode, we use the default 
`ArrowWriterOptions` and the arrow schema is inserted into file kv_meta. 
However, when performing parallelized writes we do not use the ArrowWriter and 
instead rely upon the 
[SerializedFileWriter](https://docs.rs/parquet/53.3.0/parquet/file/writer/struct.SerializedFileWriter.html).
 As a result, we are missing the arrow schema metadata in the parquet files 
(see the issue ticket).
   
   ### ArrowWriterOptions vs WriterProperties
   
   The SerializedFileWriter, along with other associated writers, rely upon the 
[WriterProperties](https://docs.rs/parquet/53.3.0/parquet/file/properties/struct.WriterProperties.html).
 The `WriterProperties` differ from the `ArrowWriterOptions` only in terms of 
the `skip_arrow_metadata` (the missing configuration cause our current issue), 
and the `schema_root`:
   
   ```
   pub struct ArrowWriterOptions {
       properties: WriterProperties,
       skip_arrow_metadata: bool,
       schema_root: Option<String>,
   }
   ```
   
   The `skip_arrow_metadata` is only used to modify the kv_metadata within the 
WriterProperties. Therefore I focused on updating our DF methods to construct 
WriterProperties to include this arrow schema (when configured). In this way, 
we can continue using WriterProperties -- with the added features we are 
missing from ArrowWriterOptions.
   
   
   ## What changes are included in this PR?
   
   * add a new configuration `ParquetOptions.skip_arrow_metadata`
   * have ParquetSink single-threaded writes, which use the ArrowWriter, 
respect this configuration
   * have ParquetSink multiple-threaded writes, which use WriterProperties, 
respect this configuration
      * this is done by considering the inclusion of the arrow schema, or not, 
during the WriterProperties construction
   
   
   ## Are these changes tested?
   
   yes.
   
   ## Are there any user-facing changes?
   
   We have new APIs:
   * `ParquetOptions.skip_arrow_metadata` configuration
   * replace/deprecate `ParquetWriterOptions::try_from(ParquetWriterOptions)` 
and replace with methods which explicitly handle arrow schema
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

[PR] ParquetSink should be aware of arrow schema encoding (configurable) in the file metadata. [datafusion]

Reply via email to