Usage of parquet field_id

Weston Pace Mon, 17 May 2021 20:50:21 -0700

Hello Iceberg devs,

I'm Weston, I've been working on the Arrow project lately and I am
reviewing how we handle the parquet field_id (and also adding support
for specifying a field_id at write time) from parquet[1][2].   This
has brought up two questions.


 1. The original PR adding field_id support[3][4] not only allowed the
field_id to pass through from parquet to arrow but also generated ids
(in a depth first fashion) for fields that did not have a field_id.
In retrospect, it seems this auto-generation of field_id was probably
not a good idea.  Would it have any impact on Iceberg if we removed
it?  Just to be clear, we will still have support  for reading (and
now writing) the parquet field_id.  I am only talking about removing
the auto-generation of missing values.

 2. For the second question I'm looking for the Iceberg community's
opinion as users of Arrow.  Arrow is enabling more support for
computation on data (e.g. relational operators) and I've been
wondering how those transformations should affect metadata (like the
field_id).  For some examples:

 * Filtering a table by column (it seems the field_id/metadata should
remain unchanged)
 * Filtering a table by rows (it seems the field_id/metadata should
remain unchanged)
 * Filling in null values with a placeholder value (the data is changed so ???)
 * Casting a field to a different data type (the meaning of the data
has changed so ???)
 * Combining two fields into a third field (it seems the
field_id/metadata should be erased in the third field but presumably
it could also be the joined metadata from the two origin fields)

Thanks for your time,

-Weston Pace

[1] https://issues.apache.org/jira/browse/PARQUET-1798
[2] https://github.com/apache/arrow/pull/10289
[3] https://issues.apache.org/jira/browse/ARROW-7080
[4] https://github.com/apache/arrow/pull/6408

Usage of parquet field_id

Reply via email to