I wonder if you could add the file_name as metadata on the `Schema` of the
RecordBatch rather than the RecordBatch itself? Since every RecordBatch has
a schema, I don't fully understand the need to add something additional to
the RecordBatch

https://docs.rs/arrow/3.0.0/arrow/datatypes/struct.Schema.html#method.new_with_metadata

On Wed, Feb 24, 2021 at 1:20 AM Mike Seddon <seddo...@gmail.com> wrote:

> Hi,
>
> One of Apache Spark's very useful SQL functions is the 'input_file_name'
> SQL function which provides a simple API for identifying the source of a
> row of data when sourced from a file-based source like Parquet or CSV. This
> is particularly useful for identifying which chunk/partition of a Parquet
> the row came from and is used heavily by the DeltaLake format to determine
> which files are impacted for MERGE operations.
>
> I have built a functional proof-of-concept for DataFusion but it requires
> modifying the RecordBatch struct to include a 'metadata' struct
> (RecordBatchMetadata) to carry the source file name attached to each batch.
>
> It also requires changing the ScalarFunctionImplementation signature to
> support exposing the metadata (and therefore all the functions).
>
> From: <Arc<dyn Fn(&[ColumnarValue]) -> Result<ColumnarValue> + Send +
> Sync>;
> To:   <Arc<dyn Fn(&[ColumnarValue], RecordBatchMetadata) ->
> Result<ColumnarValue> + Send + Sync>;
>
> These changes have been made in a personal feature branch and are available
> for review (still needs cleaning) but conceptually does anyone have a
> problem with this API change or have a better proposal?
>
> Thanks
> Mike
>

Reply via email to