Re: [PR] ParquetSink should be aware of arrow schema encoding for the file metadata. [datafusion]

via GitHub Thu, 26 Dec 2024 14:12:47 -0800


alamb commented on code in PR #13866:
URL: https://github.com/apache/datafusion/pull/13866#discussion_r1898124585



##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -51,38 +58,53 @@ impl ParquetWriterOptions {
     }
 }
 
-impl TryFrom<&TableParquetOptions> for ParquetWriterOptions {
-    type Error = DataFusionError;
-
-    fn try_from(parquet_table_options: &TableParquetOptions) -> Result<Self> {
+impl TableParquetOptions {

Review Comment:
   It would be nice to avoid having to change this API -- I have a suggestion 
below



##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -51,38 +58,53 @@ impl ParquetWriterOptions {
     }
 }
 
-impl TryFrom<&TableParquetOptions> for ParquetWriterOptions {
-    type Error = DataFusionError;
-
-    fn try_from(parquet_table_options: &TableParquetOptions) -> Result<Self> {
+impl TableParquetOptions {
+    #[deprecated(
+        since = "44.0.0",
+        note = "Please use 
`TableParquetOptions::into_writer_properties_builder` and 
`TableParquetOptions::into_writer_properties_builder_with_arrow_schema`"
+    )]
+    pub fn try_from(table_opts: &TableParquetOptions) -> 
Result<ParquetWriterOptions> {
         // ParquetWriterOptions will have defaults for the remaining fields 
(e.g. sorting_columns)
         Ok(ParquetWriterOptions {
-            writer_options: 
WriterPropertiesBuilder::try_from(parquet_table_options)?
-                .build(),
+            writer_options: 
table_opts.into_writer_properties_builder()?.build(),
         })
     }
-}
-
-impl TryFrom<&TableParquetOptions> for WriterPropertiesBuilder {
-    type Error = DataFusionError;
 
     /// Convert the session's [`TableParquetOptions`] into a single write 
action's [`WriterPropertiesBuilder`].
     ///
     /// The returned [`WriterPropertiesBuilder`] includes customizations 
applicable per column.
-    fn try_from(table_parquet_options: &TableParquetOptions) -> Result<Self> {
+    pub fn into_writer_properties_builder(&self) -> 
Result<WriterPropertiesBuilder> {
+        self.into_writer_properties_builder_with_arrow_schema(None)
+    }
+
+    /// Convert the session's [`TableParquetOptions`] into a single write 
action's [`WriterPropertiesBuilder`].
+    ///
+    /// The returned [`WriterPropertiesBuilder`] includes customizations 
applicable per column,
+    /// as well as the arrow schema encoded into the kv_meta at 
[`ARROW_SCHEMA_META_KEY`].
+    pub fn into_writer_properties_builder_with_arrow_schema(

Review Comment:
   I think making setting the arrow schema a different function might be more 
consistent with the rest of this API 
https://docs.rs/datafusion/latest/datafusion/config/struct.ParquetOptions.html
   
   
   So for example I would expect that 
`ParquetOptions::into_writer_properties_builder` would always set the arrow 
metadata to be consistent with how `WriterProperties` works 🤔 
   
   If someone doesn't want the arrow metadata, I would expect an option to 
disable it, like 
https://docs.rs/parquet/53.3.0/parquet/arrow/arrow_writer/struct.ArrowWriterOptions.html#method.with_skip_arrow_metadata
   
   So that might mean that `TableParquetOptions` has a field like 
`skip_arrow_metadata`
   
   And then depending on the value of that field, 
`into_writer_properties_builder` would set/not set the metadata appropriately
   
   That would also avoid having to change the conversion API



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -2519,6 +2612,42 @@ mod tests {
         Ok(parquet_sink)
     }
 
+    fn get_written(parquet_sink: Arc<ParquetSink>) -> Result<(Path, 
FileMetaData)> {
+        let mut written = parquet_sink.written();
+        let written = written.drain();
+        assert_eq!(
+            written.len(),
+            1,
+            "expected a single parquet files to be written, instead found {}",
+            written.len()
+        );
+
+        let (path, file_metadata) = written.take(1).next().unwrap();
+        Ok((path, file_metadata))
+    }
+
+    fn assert_file_metadata(file_metadata: FileMetaData, expected_kv: 
Vec<KeyValue>) {

Review Comment:
   If you passed in a ref here it might make the code above simpler (less 
clone):
   
   ```suggestion
       fn assert_file_metadata(file_metadata: &FileMetaData, expected_kv: 
&[KeyValue]) {
   ```



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -759,10 +778,14 @@ impl ParquetSink {
         parquet_props: WriterProperties,
     ) -> Result<AsyncArrowWriter<BufWriter>> {
         let buf_writer = BufWriter::new(object_store, location.clone());
-        let writer = AsyncArrowWriter::try_new(
+        let options = ArrowWriterOptions::new()

Review Comment:
   this is nice



##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -750,6 +749,26 @@ impl ParquetSink {
         }
     }
 
+    /// Create writer properties based upon configuration settings,
+    /// including partitioning and the inclusion of arrow schema metadata.
+    fn create_writer_props(&self) -> Result<WriterProperties> {
+        let props = if !self.parquet_options.global.skip_arrow_metadata {
+            let schema = if 
self.parquet_options.global.allow_single_file_parallelism {

Review Comment:
   Can you explain why different schemas are needed for single file parallelism?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [PR] ParquetSink should be aware of arrow schema encoding for the file metadata. [datafusion]

Reply via email to