Hi Micah,
Thank you for providing this information. I have reviewed the documentation
you provided and have a few conclusions:

1. RecordBatch does not have the capability to attach user defined metadata
(KeyValue attributes):
https://github.com/apache/arrow/blob/master/format/Message.fbs#L83
2. Schema does have this capability but it would not work to pass per-batch
input files as the design indicates that the Schema object would be passed
once and then a series of interleaved DictionaryBatch or MessageBatch
messages must meet the Schema:
https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#ipc-streaming-format

In the Rust implementation each RecordBatch embeds the Schema so that each
schema can have different metadata (like filename in this case). I think
this will have to be implemented Rust as a memory-only attribute which does
not get persisted unless more significant changes to the protocol are made.

Thanks
Mike

On Thu, Feb 25, 2021 at 11:14 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> The process would be to create a PR proposal to update to the custom
> metadata specification [1] to reserve a new word and describe its use.
> Then send a [DISCUSS] email on this list.  Once there is consensus we can
> formally vote and merge the change.
>
> [1]
> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst
>
> On Wed, Feb 24, 2021 at 3:47 PM Mike Seddon <seddo...@gmail.com> wrote:
>
> > Thanks for both of your comments.
> >
> > @Andrew Schema.metadata does look like a logical place to house the
> > information so that would solve part of the problem. Do you have any
> > thoughts on whether we change the function signature:
> >
> > From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send +
> > Sync>;
> > To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
> > Result<ColumnarValue> + Send + Sync>;
> >
> > @Micah It would be nice to have a reserved metadata key so this could be
> > shared but I am not sure of the admin process for the Arrow project to
> > agree something like that. Is there a forum?
> >
> > On Thu, Feb 25, 2021 at 8:58 AM Micah Kornfield <emkornfi...@gmail.com>
> > wrote:
> >
> > > At least C++ (and the IPC format) a schema can be shared across the
> many
> > > RecordBatch's which might have different sources.
> > >
> > >  It might be useful to define a reserved metadata key (similar to
> > > extension types) so that the data can be interpreted consistently.
> > >
> > > On Wed, Feb 24, 2021 at 11:29 AM Andrew Lamb <al...@influxdata.com>
> > wrote:
> > >
> > > > I wonder if you could add the file_name as metadata on the `Schema`
> of
> > > the
> > > > RecordBatch rather than the RecordBatch itself? Since every
> RecordBatch
> > > has
> > > > a schema, I don't fully understand the need to add something
> additional
> > > to
> > > > the RecordBatch
> > > >
> > > >
> > > >
> > >
> >
> https://docs.rs/arrow/3.0.0/arrow/datatypes/struct.Schema.html#method.new_with_metadata
> > > >
> > > > On Wed, Feb 24, 2021 at 1:20 AM Mike Seddon <seddo...@gmail.com>
> > wrote:
> > > >
> > > > > Hi,
> > > > >
> > > > > One of Apache Spark's very useful SQL functions is the
> > > 'input_file_name'
> > > > > SQL function which provides a simple API for identifying the source
> > of
> > > a
> > > > > row of data when sourced from a file-based source like Parquet or
> > CSV.
> > > > This
> > > > > is particularly useful for identifying which chunk/partition of a
> > > Parquet
> > > > > the row came from and is used heavily by the DeltaLake format to
> > > > determine
> > > > > which files are impacted for MERGE operations.
> > > > >
> > > > > I have built a functional proof-of-concept for DataFusion but it
> > > requires
> > > > > modifying the RecordBatch struct to include a 'metadata' struct
> > > > > (RecordBatchMetadata) to carry the source file name attached to
> each
> > > > batch.
> > > > >
> > > > > It also requires changing the ScalarFunctionImplementation
> signature
> > to
> > > > > support exposing the metadata (and therefore all the functions).
> > > > >
> > > > > From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> +
> Send +
> > > > > Sync>;
> > > > > To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
> > > > > Result<ColumnarValue> + Send + Sync>;
> > > > >
> > > > > These changes have been made in a personal feature branch and are
> > > > available
> > > > > for review (still needs cleaning) but conceptually does anyone
> have a
> > > > > problem with this API change or have a better proposal?
> > > > >
> > > > > Thanks
> > > > > Mike
> > > > >
> > > >
> > >
> >
>

Reply via email to