alamb commented on code in PR #13866:
URL: https://github.com/apache/datafusion/pull/13866#discussion_r1898124585
##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -51,38 +58,53 @@ impl ParquetWriterOptions {
}
}
-impl TryFrom<&TableParquetOptions> for ParquetWriterOptions {
- type Error = DataFusionError;
-
- fn try_from(parquet_table_options: &TableParquetOptions) -> Result<Self> {
+impl TableParquetOptions {
Review Comment:
It would be nice to avoid having to change this API -- I have a suggestion
below
##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -51,38 +58,53 @@ impl ParquetWriterOptions {
}
}
-impl TryFrom<&TableParquetOptions> for ParquetWriterOptions {
- type Error = DataFusionError;
-
- fn try_from(parquet_table_options: &TableParquetOptions) -> Result<Self> {
+impl TableParquetOptions {
+ #[deprecated(
+ since = "44.0.0",
+ note = "Please use
`TableParquetOptions::into_writer_properties_builder` and
`TableParquetOptions::into_writer_properties_builder_with_arrow_schema`"
+ )]
+ pub fn try_from(table_opts: &TableParquetOptions) ->
Result<ParquetWriterOptions> {
// ParquetWriterOptions will have defaults for the remaining fields
(e.g. sorting_columns)
Ok(ParquetWriterOptions {
- writer_options:
WriterPropertiesBuilder::try_from(parquet_table_options)?
- .build(),
+ writer_options:
table_opts.into_writer_properties_builder()?.build(),
})
}
-}
-
-impl TryFrom<&TableParquetOptions> for WriterPropertiesBuilder {
- type Error = DataFusionError;
/// Convert the session's [`TableParquetOptions`] into a single write
action's [`WriterPropertiesBuilder`].
///
/// The returned [`WriterPropertiesBuilder`] includes customizations
applicable per column.
- fn try_from(table_parquet_options: &TableParquetOptions) -> Result<Self> {
+ pub fn into_writer_properties_builder(&self) ->
Result<WriterPropertiesBuilder> {
+ self.into_writer_properties_builder_with_arrow_schema(None)
+ }
+
+ /// Convert the session's [`TableParquetOptions`] into a single write
action's [`WriterPropertiesBuilder`].
+ ///
+ /// The returned [`WriterPropertiesBuilder`] includes customizations
applicable per column,
+ /// as well as the arrow schema encoded into the kv_meta at
[`ARROW_SCHEMA_META_KEY`].
+ pub fn into_writer_properties_builder_with_arrow_schema(
Review Comment:
I think making setting the arrow schema a different function might be more
consistent with the rest of this API
https://docs.rs/datafusion/latest/datafusion/config/struct.ParquetOptions.html
So for example I would expect that
`ParquetOptions::into_writer_properties_builder` would always set the arrow
metadata to be consistent with how `WriterProperties` works 🤔
If someone doesn't want the arrow metadata, I would expect an option to
disable it, like
https://docs.rs/parquet/53.3.0/parquet/arrow/arrow_writer/struct.ArrowWriterOptions.html#method.with_skip_arrow_metadata
So that might mean that `TableParquetOptions` has a field like
`skip_arrow_metadata`
And then depending on the value of that field,
`into_writer_properties_builder` would set/not set the metadata appropriately
That would also avoid having to change the conversion API
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -2519,6 +2612,42 @@ mod tests {
Ok(parquet_sink)
}
+ fn get_written(parquet_sink: Arc<ParquetSink>) -> Result<(Path,
FileMetaData)> {
+ let mut written = parquet_sink.written();
+ let written = written.drain();
+ assert_eq!(
+ written.len(),
+ 1,
+ "expected a single parquet files to be written, instead found {}",
+ written.len()
+ );
+
+ let (path, file_metadata) = written.take(1).next().unwrap();
+ Ok((path, file_metadata))
+ }
+
+ fn assert_file_metadata(file_metadata: FileMetaData, expected_kv:
Vec<KeyValue>) {
Review Comment:
If you passed in a ref here it might make the code above simpler (less
clone):
```suggestion
fn assert_file_metadata(file_metadata: &FileMetaData, expected_kv:
&[KeyValue]) {
```
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -759,10 +778,14 @@ impl ParquetSink {
parquet_props: WriterProperties,
) -> Result<AsyncArrowWriter<BufWriter>> {
let buf_writer = BufWriter::new(object_store, location.clone());
- let writer = AsyncArrowWriter::try_new(
+ let options = ArrowWriterOptions::new()
Review Comment:
this is nice
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -750,6 +749,26 @@ impl ParquetSink {
}
}
+ /// Create writer properties based upon configuration settings,
+ /// including partitioning and the inclusion of arrow schema metadata.
+ fn create_writer_props(&self) -> Result<WriterProperties> {
+ let props = if !self.parquet_options.global.skip_arrow_metadata {
+ let schema = if
self.parquet_options.global.allow_single_file_parallelism {
Review Comment:
Can you explain why different schemas are needed for single file parallelism?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]