The process would be to create a PR proposal to update to the custom metadata specification [1] to reserve a new word and describe its use. Then send a [DISCUSS] email on this list. Once there is consensus we can formally vote and merge the change.
[1] https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst On Wed, Feb 24, 2021 at 3:47 PM Mike Seddon <seddo...@gmail.com> wrote: > Thanks for both of your comments. > > @Andrew Schema.metadata does look like a logical place to house the > information so that would solve part of the problem. Do you have any > thoughts on whether we change the function signature: > > From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send + > Sync>; > To: <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) -> > Result<ColumnarValue> + Send + Sync>; > > @Micah It would be nice to have a reserved metadata key so this could be > shared but I am not sure of the admin process for the Arrow project to > agree something like that. Is there a forum? > > On Thu, Feb 25, 2021 at 8:58 AM Micah Kornfield <emkornfi...@gmail.com> > wrote: > > > At least C++ (and the IPC format) a schema can be shared across the many > > RecordBatch's which might have different sources. > > > > It might be useful to define a reserved metadata key (similar to > > extension types) so that the data can be interpreted consistently. > > > > On Wed, Feb 24, 2021 at 11:29 AM Andrew Lamb <al...@influxdata.com> > wrote: > > > > > I wonder if you could add the file_name as metadata on the `Schema` of > > the > > > RecordBatch rather than the RecordBatch itself? Since every RecordBatch > > has > > > a schema, I don't fully understand the need to add something additional > > to > > > the RecordBatch > > > > > > > > > > > > https://docs.rs/arrow/3.0.0/arrow/datatypes/struct.Schema.html#method.new_with_metadata > > > > > > On Wed, Feb 24, 2021 at 1:20 AM Mike Seddon <seddo...@gmail.com> > wrote: > > > > > > > Hi, > > > > > > > > One of Apache Spark's very useful SQL functions is the > > 'input_file_name' > > > > SQL function which provides a simple API for identifying the source > of > > a > > > > row of data when sourced from a file-based source like Parquet or > CSV. > > > This > > > > is particularly useful for identifying which chunk/partition of a > > Parquet > > > > the row came from and is used heavily by the DeltaLake format to > > > determine > > > > which files are impacted for MERGE operations. > > > > > > > > I have built a functional proof-of-concept for DataFusion but it > > requires > > > > modifying the RecordBatch struct to include a 'metadata' struct > > > > (RecordBatchMetadata) to carry the source file name attached to each > > > batch. > > > > > > > > It also requires changing the ScalarFunctionImplementation signature > to > > > > support exposing the metadata (and therefore all the functions). > > > > > > > > From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send + > > > > Sync>; > > > > To: <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) -> > > > > Result<ColumnarValue> + Send + Sync>; > > > > > > > > These changes have been made in a personal feature branch and are > > > available > > > > for review (still needs cleaning) but conceptually does anyone have a > > > > problem with this API change or have a better proposal? > > > > > > > > Thanks > > > > Mike > > > > > > > > > >