wiedld opened a new issue, #10223:
URL: https://github.com/apache/datafusion/issues/10223

   ### Describe the bug
   
   IOx adds our own metadata to the parquet file. Currently, we do so using the 
[WriterProperties with the 
ArrowWriter](https://github.com/apache/arrow-rs/blob/11450ae8ddf902b57cb42491a3d824d9550a05ea/parquet/src/arrow/arrow_writer/mod.rs#L147).
 We want to start performing parquet writes with datafusion's ParquetSink, 
however a recent change has removed this ability to add our own metadata.
   
   There was a change to unify the different writer options across sink types, 
specifically to make `COPY TO` and `create external table` have a uniform 
configuration. Users can [now specify the configuration with the 
query](https://github.com/apache/datafusion/pull/9382) (e.g. `COPY <src> TO 
<sink> (<config_options>)`). This was a good high level change; however, a side 
effect was the removal of the ability to add our own metadata.
   
   The current implementation (after the above change) now [derives the writer 
properties from the 
TableParquetOptions](https://github.com/apache/datafusion/blob/06895157e7f985fc4d9b0b6298c07d92abb4cc07/datafusion/core/src/datasource/file_format/parquet.rs#L648).
 This conversion always sets the sorting_columns and user-defined kv_metadata 
as None, as demonstrated in [the first commit of the 
fix](https://github.com/influxdata/arrow-datafusion/pull/11/commits/391e07466b5c5a381e578e50a1abc89f509e0a69).
   
   
   ### To Reproduce
   
   The hardcoded setting of the user metadata to None is demonstrated in [this 
commit](https://github.com/influxdata/arrow-datafusion/pull/11/commits/391e07466b5c5a381e578e50a1abc89f509e0a69).
   
   ### Expected behavior
   
   The expected behavior is to be able to set our own metadata. Ideally, to 
have user-inserted metadata as an option at the SQL level API. 
   ```
   COPY source_table TO 'sink' STORED AS PARQUET OPTIONS ('format.metadata' 
'key:value')
   ```
   
   The expected outcome is demonstrated in [this 
commit](https://github.com/influxdata/arrow-datafusion/pull/11/commits/8beb16ad037cdd7f33da5b4f8277bb3bc2baa305).
 
   
   ### Additional context
   
   _No response_


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to