alamb commented on code in PR #10224:
URL: https://github.com/apache/datafusion/pull/10224#discussion_r1579412563


##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -1907,6 +1916,13 @@ mod tests {
             "output file metadata should contain col b"
         );
 
+        let key_value_metadata = key_value_metadata.unwrap();
+        let my_metadata = key_value_metadata
+            .iter()
+            .filter(|kv| kv.key == "my-data")
+            .collect::<Vec<_>>();
+        assert_eq!(my_metadata.len(), 1);
+

Review Comment:
   I think it would be good here to verify the actual metadata value too (not 
just that it is present). Something like this perhaps:
   
   ```suggestion
           let key_value_metadata = key_value_metadata.unwrap();
           let expected_metadata = vec![KeyValue {
               key: "my-data".to_string(),
               value: Some("stuff".to_string()),
           }];
           assert_eq!(key_value_metadata, expected_metadata);
   ```



##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -47,53 +53,87 @@ impl TryFrom<&TableParquetOptions> for ParquetWriterOptions 
{
     type Error = DataFusionError;
 
     fn try_from(parquet_options: &TableParquetOptions) -> Result<Self> {
-        let parquet_session_options = &parquet_options.global;
-        let mut builder = WriterProperties::builder()
-            
.set_data_page_size_limit(parquet_session_options.data_pagesize_limit)
-            .set_write_batch_size(parquet_session_options.write_batch_size)
-            .set_writer_version(parse_version_string(
-                &parquet_session_options.writer_version,
-            )?)
-            .set_dictionary_page_size_limit(
-                parquet_session_options.dictionary_page_size_limit,
-            )
-            .set_max_row_group_size(parquet_session_options.max_row_group_size)
-            .set_created_by(parquet_session_options.created_by.clone())
-            .set_column_index_truncate_length(
-                parquet_session_options.column_index_truncate_length,
+        let ParquetOptions {

Review Comment:
   I like this change as it makes it harder to forget to pass through options 
if we add new options to `TableParquetOptions`



##########
datafusion/sqllogictest/test_files/copy.slt:
##########
@@ -283,11 +283,38 @@ OPTIONS (
 'format.statistics_enabled::col2' none,
 'format.max_statistics_size' 123,
 'format.bloom_filter_fpp' 0.001,
-'format.bloom_filter_ndv' 100
+'format.bloom_filter_ndv' 100,
+'format.metadata' 'foo:bar baz'
 )
 ----
 2
 
+# valid vs invalid metadata
+
+statement ok
+COPY source_table
+TO 'test_files/scratch/copy/table_with_metadata/'
+STORED AS PARQUET
+OPTIONS (
+'format.metadata' ''
+)
+
+statement error
+COPY source_table
+TO 'test_files/scratch/copy/table_with_metadata/'
+STORED AS PARQUET
+OPTIONS (
+'format.metadata' 'foo:bar:extra'
+)
+
+statement error

Review Comment:
   ```suggestion
   statement error DataFusion error: Invalid or Unsupported Configuration: 
Config value "wrong\-metadata\-key" not found on ParquetOptions
   ```



##########
datafusion/common/src/config.rs:
##########
@@ -1370,6 +1370,9 @@ pub struct TableParquetOptions {
     pub global: ParquetOptions,
     /// Column specific options. Default usage is parquet.XX::column.
     pub column_specific_options: HashMap<String, ColumnOptions>,
+    /// Additional metadata to be inserted into the key_value_metadata

Review Comment:
   ```suggestion
       /// Additional file-level metadata to include. Inserted into the 
key_value_metadata
   ```



##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -141,6 +181,8 @@ impl TryFrom<&TableParquetOptions> for ParquetWriterOptions 
{
                     builder.set_column_max_statistics_size(path, 
max_statistics_size);
             }
         }
+
+        // ParquetWriterOptions will have defaults for the remaining fields 
(e.g. key_value_metadata & sorting_columns)

Review Comment:
   This might be useful to include as a comment higher up in the doc comments 
as well, perhaps on `TableParquetOptions`



##########
datafusion/sqllogictest/test_files/copy.slt:
##########
@@ -283,11 +283,38 @@ OPTIONS (
 'format.statistics_enabled::col2' none,
 'format.max_statistics_size' 123,
 'format.bloom_filter_fpp' 0.001,
-'format.bloom_filter_ndv' 100
+'format.bloom_filter_ndv' 100,
+'format.metadata' 'foo:bar baz'
 )
 ----
 2
 
+# valid vs invalid metadata
+
+statement ok
+COPY source_table
+TO 'test_files/scratch/copy/table_with_metadata/'
+STORED AS PARQUET
+OPTIONS (
+'format.metadata' ''
+)
+
+statement error

Review Comment:
   I think it is valuable to include the error message as well (to ensure the 
test doesn't have a typo or something and is thus not actually covering the 
error we exepect):
   
   ```suggestion
   statement error DataFusion error: Invalid or Unsupported Configuration: 
Invalid metadata provided "foo:bar:extra"
   ```
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to