Re: [Rust][DataFusion] Supporting input_file_name()

Mike Seddon Wed, 24 Feb 2021 15:47:23 -0800

Thanks for both of your comments.

@Andrew Schema.metadata does look like a logical place to house the
information so that would solve part of the problem. Do you have any
thoughts on whether we change the function signature:


From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send + Sync>;
To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
Result<ColumnarValue> + Send + Sync>;

@Micah It would be nice to have a reserved metadata key so this could be
shared but I am not sure of the admin process for the Arrow project to
agree something like that. Is there a forum?

On Thu, Feb 25, 2021 at 8:58 AM Micah Kornfield <[email protected]>
wrote:

> At least C++ (and the IPC format) a schema can be shared across the many
> RecordBatch's which might have different sources.
>
>  It might be useful to define a reserved metadata key (similar to
> extension types) so that the data can be interpreted consistently.
>
> On Wed, Feb 24, 2021 at 11:29 AM Andrew Lamb <[email protected]> wrote:
>
> > I wonder if you could add the file_name as metadata on the `Schema` of
> the
> > RecordBatch rather than the RecordBatch itself? Since every RecordBatch
> has
> > a schema, I don't fully understand the need to add something additional
> to
> > the RecordBatch
> >
> >
> >
> https://docs.rs/arrow/3.0.0/arrow/datatypes/struct.Schema.html#method.new_with_metadata
> >
> > On Wed, Feb 24, 2021 at 1:20 AM Mike Seddon <[email protected]> wrote:
> >
> > > Hi,
> > >
> > > One of Apache Spark's very useful SQL functions is the
> 'input_file_name'
> > > SQL function which provides a simple API for identifying the source of
> a
> > > row of data when sourced from a file-based source like Parquet or CSV.
> > This
> > > is particularly useful for identifying which chunk/partition of a
> Parquet
> > > the row came from and is used heavily by the DeltaLake format to
> > determine
> > > which files are impacted for MERGE operations.
> > >
> > > I have built a functional proof-of-concept for DataFusion but it
> requires
> > > modifying the RecordBatch struct to include a 'metadata' struct
> > > (RecordBatchMetadata) to carry the source file name attached to each
> > batch.
> > >
> > > It also requires changing the ScalarFunctionImplementation signature to
> > > support exposing the metadata (and therefore all the functions).
> > >
> > > From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send +
> > > Sync>;
> > > To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
> > > Result<ColumnarValue> + Send + Sync>;
> > >
> > > These changes have been made in a personal feature branch and are
> > available
> > > for review (still needs cleaning) but conceptually does anyone have a
> > > problem with this API change or have a better proposal?
> > >
> > > Thanks
> > > Mike
> > >
> >
>

Re: [Rust][DataFusion] Supporting input_file_name()

Reply via email to