wiedld opened a new pull request, #13866: URL: https://github.com/apache/datafusion/pull/13866
## Which issue does this PR close? Closes https://github.com/apache/datafusion/issues/11770 ## Rationale for this change The [ArrowWriter](https://docs.rs/parquet/53.3.0/parquet/arrow/arrow_writer/struct.ArrowWriter.html) with it's default `ArrowWriterOptions` will encode the arrow schema in the parquet kv_metadata, unless explicitly skipped. Skipping is done via [ArrowWriterOptions::with_skip_arrow_metadata](https://docs.rs/parquet/53.3.0/parquet/arrow/arrow_writer/struct.ArrowWriterOptions.html#method.with_skip_arrow_metadata). In datafusion's ParquetSink, we can write in either single threaded or parallelized format. When in single-threaded mode, we use the default `ArrowWriterOptions` and the arrow schema is inserted into file kv_meta. However, when performing parallelized writes we do not use the ArrowWriter and instead rely upon the [SerializedFileWriter](https://docs.rs/parquet/53.3.0/parquet/file/writer/struct.SerializedFileWriter.html). As a result, we are missing the arrow schema metadata in the parquet files (see the issue ticket). ### ArrowWriterOptions vs WriterProperties The SerializedFileWriter, along with other associated writers, rely upon the [WriterProperties](https://docs.rs/parquet/53.3.0/parquet/file/properties/struct.WriterProperties.html). The `WriterProperties` differ from the `ArrowWriterOptions` only in terms of the `skip_arrow_metadata` (the missing configuration cause our current issue), and the `schema_root`: ``` pub struct ArrowWriterOptions { properties: WriterProperties, skip_arrow_metadata: bool, schema_root: Option<String>, } ``` The `skip_arrow_metadata` is only used to modify the kv_metadata within the WriterProperties. Therefore I focused on updating our DF methods to construct WriterProperties to include this arrow schema (when configured). In this way, we can continue using WriterProperties -- with the added features we are missing from ArrowWriterOptions. ## What changes are included in this PR? * add a new configuration `ParquetOptions.skip_arrow_metadata` * have ParquetSink single-threaded writes, which use the ArrowWriter, respect this configuration * have ParquetSink multiple-threaded writes, which use WriterProperties, respect this configuration * this is done by considering the inclusion of the arrow schema, or not, during the WriterProperties construction ## Are these changes tested? yes. ## Are there any user-facing changes? We have new APIs: * `ParquetOptions.skip_arrow_metadata` configuration * replace/deprecate `ParquetWriterOptions::try_from(ParquetWriterOptions)` and replace with methods which explicitly handle arrow schema -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org