Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-25 Thread Mike Seddon
Hi Fernando. There are two independent use cases for input_file_name(): 1. A lot of data still comes in from CSV and we process to Parquet after careful data typing rules have been applied. We then materialize the input_file_name to a column in the Parquet to be able to trace lineage of the data.

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-25 Thread Fernando Herrera
I see. You are storing the file name when reading a json, csv, and parquet file. Just out of curiosity, how would you use the file name in spark? Are you using it for file statistics? On Thu, Feb 25, 2021 at 9:36 AM Mike Seddon wrote: > Hi Fernando, > > After Andrew's reply I have moved the fil

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-25 Thread Mike Seddon
Hi Fernando, After Andrew's reply I have moved the filename metadata into the Schema and actually changed the ScalarFunctionImplementation signature to: Arc Result + Send + Sync>; I have a functional (WIP) repo already: https://github.com/seddonm1/arrow/compare/master...seddonm1:input-file I ne

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-25 Thread Fernando Herrera
Hi Mike, I've been thinking how you are considering adding metadata to the RecordBatch. The struct it is now defined as pub struct RecordBatch { > schema: SchemaRef, > columns: Vec>, > } Are you suggesting something like this? pub struct RecordBatch { > schema: SchemaRef, > co

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-24 Thread Mike Seddon
Thanks Micah. It is actually Rust implementation that is the odd one out. Ideally adding a metadata KeyValue to the RecordBatch plus your suggested 'reserved' key would be the best option. On Thu, Feb 25, 2021 at 3:26 PM Micah Kornfield wrote: > Thanks for looking into it. I would guess it is l

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-24 Thread Micah Kornfield
Thanks for looking into it. I would guess it is likely possible "hoist" metadata from a record batch schema object to the Message but understand if it isn't something you want to pursue. On Wed, Feb 24, 2021 at 8:19 PM Mike Seddon wrote: > Hi Micah, > Thank you for providing this information. I

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-24 Thread Mike Seddon
Hi Micah, Thank you for providing this information. I have reviewed the documentation you provided and have a few conclusions: 1. RecordBatch does not have the capability to attach user defined metadata (KeyValue attributes): https://github.com/apache/arrow/blob/master/format/Message.fbs#L83 2. Sc

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-24 Thread Micah Kornfield
The process would be to create a PR proposal to update to the custom metadata specification [1] to reserve a new word and describe its use. Then send a [DISCUSS] email on this list. Once there is consensus we can formally vote and merge the change. [1] https://github.com/apache/arrow/blob/master/

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-24 Thread Mike Seddon
Thanks for both of your comments. @Andrew Schema.metadata does look like a logical place to house the information so that would solve part of the problem. Do you have any thoughts on whether we change the function signature: From: Result + Send + Sync>; To: Result + Send + Sync>; @Micah It w

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-24 Thread Micah Kornfield
At least C++ (and the IPC format) a schema can be shared across the many RecordBatch's which might have different sources. It might be useful to define a reserved metadata key (similar to extension types) so that the data can be interpreted consistently. On Wed, Feb 24, 2021 at 11:29 AM Andrew L

Re: [Rust][DataFusion] Supporting input_file_name()

2021-02-24 Thread Andrew Lamb
I wonder if you could add the file_name as metadata on the `Schema` of the RecordBatch rather than the RecordBatch itself? Since every RecordBatch has a schema, I don't fully understand the need to add something additional to the RecordBatch https://docs.rs/arrow/3.0.0/arrow/datatypes/struct.Schem