Re: [Rust][DataFusion] Supporting input_file_name()

Mike Seddon Wed, 24 Feb 2021 20:36:22 -0800

Thanks Micah.

It is actually Rust implementation that is the odd one out. Ideally adding
a metadata KeyValue to the RecordBatch plus your suggested 'reserved' key
would be the best option.


On Thu, Feb 25, 2021 at 3:26 PM Micah Kornfield <[email protected]>
wrote:

> Thanks for looking into it. I would guess it is likely possible "hoist"
> metadata from a record batch schema object to the Message but understand if
> it isn't something you want to pursue.
>
> On Wed, Feb 24, 2021 at 8:19 PM Mike Seddon <[email protected]> wrote:
>
>> Hi Micah,
>> Thank you for providing this information. I have reviewed the
>> documentation you provided and have a few conclusions:
>>
>> 1. RecordBatch does not have the capability to attach user defined
>> metadata (KeyValue attributes):
>> https://github.com/apache/arrow/blob/master/format/Message.fbs#L83
>> 2. Schema does have this capability but it would not work to pass
>> per-batch input files as the design indicates that the Schema object would
>> be passed once and then a series of interleaved DictionaryBatch or
>> MessageBatch messages must meet the Schema:
>> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst#ipc-streaming-format
>>
>> In the Rust implementation each RecordBatch embeds the Schema so that
>> each schema can have different metadata (like filename in this case). I
>> think this will have to be implemented Rust as a memory-only attribute
>> which does not get persisted unless more significant changes to the
>> protocol are made.
>>
>> Thanks
>> Mike
>>
>> On Thu, Feb 25, 2021 at 11:14 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>>> The process would be to create a PR proposal to update to the custom
>>> metadata specification [1] to reserve a new word and describe its use.
>>> Then send a [DISCUSS] email on this list.  Once there is consensus we can
>>> formally vote and merge the change.
>>>
>>> [1]
>>>
>>> https://github.com/apache/arrow/blob/master/docs/source/format/Columnar.rst
>>>
>>> On Wed, Feb 24, 2021 at 3:47 PM Mike Seddon <[email protected]> wrote:
>>>
>>> > Thanks for both of your comments.
>>> >
>>> > @Andrew Schema.metadata does look like a logical place to house the
>>> > information so that would solve part of the problem. Do you have any
>>> > thoughts on whether we change the function signature:
>>> >
>>> > From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send +
>>> > Sync>;
>>> > To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
>>> > Result<ColumnarValue> + Send + Sync>;
>>> >
>>> > @Micah It would be nice to have a reserved metadata key so this could
>>> be
>>> > shared but I am not sure of the admin process for the Arrow project to
>>> > agree something like that. Is there a forum?
>>> >
>>> > On Thu, Feb 25, 2021 at 8:58 AM Micah Kornfield <[email protected]
>>> >
>>> > wrote:
>>> >
>>> > > At least C++ (and the IPC format) a schema can be shared across the
>>> many
>>> > > RecordBatch's which might have different sources.
>>> > >
>>> > >  It might be useful to define a reserved metadata key (similar to
>>> > > extension types) so that the data can be interpreted consistently.
>>> > >
>>> > > On Wed, Feb 24, 2021 at 11:29 AM Andrew Lamb <[email protected]>
>>> > wrote:
>>> > >
>>> > > > I wonder if you could add the file_name as metadata on the
>>> `Schema` of
>>> > > the
>>> > > > RecordBatch rather than the RecordBatch itself? Since every
>>> RecordBatch
>>> > > has
>>> > > > a schema, I don't fully understand the need to add something
>>> additional
>>> > > to
>>> > > > the RecordBatch
>>> > > >
>>> > > >
>>> > > >
>>> > >
>>> >
>>> https://docs.rs/arrow/3.0.0/arrow/datatypes/struct.Schema.html#method.new_with_metadata
>>> > > >
>>> > > > On Wed, Feb 24, 2021 at 1:20 AM Mike Seddon <[email protected]>
>>> > wrote:
>>> > > >
>>> > > > > Hi,
>>> > > > >
>>> > > > > One of Apache Spark's very useful SQL functions is the
>>> > > 'input_file_name'
>>> > > > > SQL function which provides a simple API for identifying the
>>> source
>>> > of
>>> > > a
>>> > > > > row of data when sourced from a file-based source like Parquet or
>>> > CSV.
>>> > > > This
>>> > > > > is particularly useful for identifying which chunk/partition of a
>>> > > Parquet
>>> > > > > the row came from and is used heavily by the DeltaLake format to
>>> > > > determine
>>> > > > > which files are impacted for MERGE operations.
>>> > > > >
>>> > > > > I have built a functional proof-of-concept for DataFusion but it
>>> > > requires
>>> > > > > modifying the RecordBatch struct to include a 'metadata' struct
>>> > > > > (RecordBatchMetadata) to carry the source file name attached to
>>> each
>>> > > > batch.
>>> > > > >
>>> > > > > It also requires changing the ScalarFunctionImplementation
>>> signature
>>> > to
>>> > > > > support exposing the metadata (and therefore all the functions).
>>> > > > >
>>> > > > > From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> +
>>> Send +
>>> > > > > Sync>;
>>> > > > > To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
>>> > > > > Result<ColumnarValue> + Send + Sync>;
>>> > > > >
>>> > > > > These changes have been made in a personal feature branch and are
>>> > > > available
>>> > > > > for review (still needs cleaning) but conceptually does anyone
>>> have a
>>> > > > > problem with this API change or have a better proposal?
>>> > > > >
>>> > > > > Thanks
>>> > > > > Mike
>>> > > > >
>>> > > >
>>> > >
>>> >
>>>
>>

Re: [Rust][DataFusion] Supporting input_file_name()

Reply via email to