Hi Fernando.
There are two independent use cases for input_file_name():
1. A lot of data still comes in from CSV and we process to Parquet after
careful data typing rules have been applied. We then materialize the
input_file_name to a column in the Parquet to be able to trace lineage of
the data.
I see. You are storing the file name when reading a json, csv, and parquet
file.
Just out of curiosity, how would you use the file name in spark?
Are you using it for file statistics?
On Thu, Feb 25, 2021 at 9:36 AM Mike Seddon wrote:
> Hi Fernando,
>
> After Andrew's reply I have moved the fil
Hi Fernando,
After Andrew's reply I have moved the filename metadata into the Schema and
actually changed the ScalarFunctionImplementation signature to: Arc Result + Send + Sync>;
I have a functional (WIP) repo already:
https://github.com/seddonm1/arrow/compare/master...seddonm1:input-file
I ne
Hi Mike,
I've been thinking how you are considering adding metadata to the
RecordBatch.
The struct it is now defined as
pub struct RecordBatch {
> schema: SchemaRef,
> columns: Vec>,
> }
Are you suggesting something like this?
pub struct RecordBatch {
> schema: SchemaRef,
> co
Thanks Micah.
It is actually Rust implementation that is the odd one out. Ideally adding
a metadata KeyValue to the RecordBatch plus your suggested 'reserved' key
would be the best option.
On Thu, Feb 25, 2021 at 3:26 PM Micah Kornfield
wrote:
> Thanks for looking into it. I would guess it is l
Thanks for looking into it. I would guess it is likely possible "hoist"
metadata from a record batch schema object to the Message but understand if
it isn't something you want to pursue.
On Wed, Feb 24, 2021 at 8:19 PM Mike Seddon wrote:
> Hi Micah,
> Thank you for providing this information. I
Hi Micah,
Thank you for providing this information. I have reviewed the documentation
you provided and have a few conclusions:
1. RecordBatch does not have the capability to attach user defined metadata
(KeyValue attributes):
https://github.com/apache/arrow/blob/master/format/Message.fbs#L83
2. Sc
The process would be to create a PR proposal to update to the custom
metadata specification [1] to reserve a new word and describe its use.
Then send a [DISCUSS] email on this list. Once there is consensus we can
formally vote and merge the change.
[1]
https://github.com/apache/arrow/blob/master/
Thanks for both of your comments.
@Andrew Schema.metadata does look like a logical place to house the
information so that would solve part of the problem. Do you have any
thoughts on whether we change the function signature:
From: Result + Send + Sync>;
To:
Result + Send + Sync>;
@Micah It w
At least C++ (and the IPC format) a schema can be shared across the many
RecordBatch's which might have different sources.
It might be useful to define a reserved metadata key (similar to
extension types) so that the data can be interpreted consistently.
On Wed, Feb 24, 2021 at 11:29 AM Andrew L
I wonder if you could add the file_name as metadata on the `Schema` of the
RecordBatch rather than the RecordBatch itself? Since every RecordBatch has
a schema, I don't fully understand the need to add something additional to
the RecordBatch
https://docs.rs/arrow/3.0.0/arrow/datatypes/struct.Schem
11 matches
Mail list logo