alamb commented on code in PR #10224:
URL: https://github.com/apache/datafusion/pull/10224#discussion_r1579412563
##########
datafusion/core/src/datasource/file_format/parquet.rs:
##########
@@ -1907,6 +1916,13 @@ mod tests {
"output file metadata should contain col b"
);
+ let key_value_metadata = key_value_metadata.unwrap();
+ let my_metadata = key_value_metadata
+ .iter()
+ .filter(|kv| kv.key == "my-data")
+ .collect::<Vec<_>>();
+ assert_eq!(my_metadata.len(), 1);
+
Review Comment:
I think it would be good here to verify the actual metadata value too (not
just that it is present). Something like this perhaps:
```suggestion
let key_value_metadata = key_value_metadata.unwrap();
let expected_metadata = vec![KeyValue {
key: "my-data".to_string(),
value: Some("stuff".to_string()),
}];
assert_eq!(key_value_metadata, expected_metadata);
```
##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -47,53 +53,87 @@ impl TryFrom<&TableParquetOptions> for ParquetWriterOptions
{
type Error = DataFusionError;
fn try_from(parquet_options: &TableParquetOptions) -> Result<Self> {
- let parquet_session_options = &parquet_options.global;
- let mut builder = WriterProperties::builder()
-
.set_data_page_size_limit(parquet_session_options.data_pagesize_limit)
- .set_write_batch_size(parquet_session_options.write_batch_size)
- .set_writer_version(parse_version_string(
- &parquet_session_options.writer_version,
- )?)
- .set_dictionary_page_size_limit(
- parquet_session_options.dictionary_page_size_limit,
- )
- .set_max_row_group_size(parquet_session_options.max_row_group_size)
- .set_created_by(parquet_session_options.created_by.clone())
- .set_column_index_truncate_length(
- parquet_session_options.column_index_truncate_length,
+ let ParquetOptions {
Review Comment:
I like this change as it makes it harder to forget to pass through options
if we add new options to `TableParquetOptions`
##########
datafusion/sqllogictest/test_files/copy.slt:
##########
@@ -283,11 +283,38 @@ OPTIONS (
'format.statistics_enabled::col2' none,
'format.max_statistics_size' 123,
'format.bloom_filter_fpp' 0.001,
-'format.bloom_filter_ndv' 100
+'format.bloom_filter_ndv' 100,
+'format.metadata' 'foo:bar baz'
)
----
2
+# valid vs invalid metadata
+
+statement ok
+COPY source_table
+TO 'test_files/scratch/copy/table_with_metadata/'
+STORED AS PARQUET
+OPTIONS (
+'format.metadata' ''
+)
+
+statement error
+COPY source_table
+TO 'test_files/scratch/copy/table_with_metadata/'
+STORED AS PARQUET
+OPTIONS (
+'format.metadata' 'foo:bar:extra'
+)
+
+statement error
Review Comment:
```suggestion
statement error DataFusion error: Invalid or Unsupported Configuration:
Config value "wrong\-metadata\-key" not found on ParquetOptions
```
##########
datafusion/common/src/config.rs:
##########
@@ -1370,6 +1370,9 @@ pub struct TableParquetOptions {
pub global: ParquetOptions,
/// Column specific options. Default usage is parquet.XX::column.
pub column_specific_options: HashMap<String, ColumnOptions>,
+ /// Additional metadata to be inserted into the key_value_metadata
Review Comment:
```suggestion
/// Additional file-level metadata to include. Inserted into the
key_value_metadata
```
##########
datafusion/common/src/file_options/parquet_writer.rs:
##########
@@ -141,6 +181,8 @@ impl TryFrom<&TableParquetOptions> for ParquetWriterOptions
{
builder.set_column_max_statistics_size(path,
max_statistics_size);
}
}
+
+ // ParquetWriterOptions will have defaults for the remaining fields
(e.g. key_value_metadata & sorting_columns)
Review Comment:
This might be useful to include as a comment higher up in the doc comments
as well, perhaps on `TableParquetOptions`
##########
datafusion/sqllogictest/test_files/copy.slt:
##########
@@ -283,11 +283,38 @@ OPTIONS (
'format.statistics_enabled::col2' none,
'format.max_statistics_size' 123,
'format.bloom_filter_fpp' 0.001,
-'format.bloom_filter_ndv' 100
+'format.bloom_filter_ndv' 100,
+'format.metadata' 'foo:bar baz'
)
----
2
+# valid vs invalid metadata
+
+statement ok
+COPY source_table
+TO 'test_files/scratch/copy/table_with_metadata/'
+STORED AS PARQUET
+OPTIONS (
+'format.metadata' ''
+)
+
+statement error
Review Comment:
I think it is valuable to include the error message as well (to ensure the
test doesn't have a typo or something and is thus not actually covering the
error we exepect):
```suggestion
statement error DataFusion error: Invalid or Unsupported Configuration:
Invalid metadata provided "foo:bar:extra"
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]