Re: Usage of parquet field_id

Weston Pace Tue, 18 May 2021 13:49:23 -0700

Ok, this is matching my understanding of how field_id is used as well.
I believe #1 will not be an issue because I think Iceberg always sets
the field_id property when writing data?  If that is the case then
Iceberg would never have noticed the old behavior.  In other words,
Iceberg never relied on Arrow to set the field_id.


For #2 I think your example is helpful.  The `field_id` is sort of a
file-specific concept.  Once you are at the dataset layer the Iceberg
schema takes precedence and the field_id is no longer necessary.

Also, thinking about it more generally, metadata is really part of the
schema / control channel.  The compute operations in Arrow are more
involved with the data channel.  "Combining metadata" might be a
concern of tools that "combine schema" (e.g. dataset evolution) but
isn't a concern of tools that combine data (e.g. Arrow compute).  So
in that sense the compute operations probably don't need to worry much
about preserving schema.

This has been helpful to hear how this is used.  I needed a concrete
example to bounce the idea around in my head with.

Thanks,

-Weston

On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <dwe...@apache.org> wrote:
>
> Hey Weston,
>
> From the Iceberg's perspective, the field_id is necessary to track the 
> evolution of the schema over time.  It's best to think of the problem from a 
> dataset perspective as opposed to a file perspective.
>
> Iceberg maintains the mapping of the schema with respect to the field ids 
> because as the files in the datasets change, the field names may change, but 
> field id is intended to be persistent and referenceable regardless of name or 
> position within the file.
>
> For #1 above, I'm not sure I understand the issue of having the field ids 
> auto-generated.  If you're not using the field ids to reference the columns, 
> does it matter if they are present or not?
>
> For #2, I would speculate that the field id is less relevant after the 
> initial projection and filtering (it really depends on how the engine wants 
> to track fields at that point, so I would suspect that maybe field id 
> wouldn't be ideal especially after various transforms or aggregations are 
> applied).  However, it does matter when persisting the data as the field ids 
> need to be resolved to the target dataset.  If it's a new dataset, new field 
> ids can be generated using the original approach.  However, if the data is 
> being appended to an existing dataset, the field ids need to be resolved 
> against that target dataset and rewritten before persisting to parquet so 
> they align with the Iceberg schema (in SQL this is done positionally).
>
> Let me know if any of that doesn't make sense.  I'm still a little unclear on 
> the issue in #1, so it would be helpful if you could clarify that for me.
>
> Thanks,
> Dan
>
> On Mon, May 17, 2021 at 8:50 PM Weston Pace <weston.p...@gmail.com> wrote:
>>
>> Hello Iceberg devs,
>>
>> I'm Weston, I've been working on the Arrow project lately and I am
>> reviewing how we handle the parquet field_id (and also adding support
>> for specifying a field_id at write time) from parquet[1][2].   This
>> has brought up two questions.
>>
>>  1. The original PR adding field_id support[3][4] not only allowed the
>> field_id to pass through from parquet to arrow but also generated ids
>> (in a depth first fashion) for fields that did not have a field_id.
>> In retrospect, it seems this auto-generation of field_id was probably
>> not a good idea.  Would it have any impact on Iceberg if we removed
>> it?  Just to be clear, we will still have support  for reading (and
>> now writing) the parquet field_id.  I am only talking about removing
>> the auto-generation of missing values.
>>
>>  2. For the second question I'm looking for the Iceberg community's
>> opinion as users of Arrow.  Arrow is enabling more support for
>> computation on data (e.g. relational operators) and I've been
>> wondering how those transformations should affect metadata (like the
>> field_id).  For some examples:
>>
>>  * Filtering a table by column (it seems the field_id/metadata should
>> remain unchanged)
>>  * Filtering a table by rows (it seems the field_id/metadata should
>> remain unchanged)
>>  * Filling in null values with a placeholder value (the data is changed so 
>> ???)
>>  * Casting a field to a different data type (the meaning of the data
>> has changed so ???)
>>  * Combining two fields into a third field (it seems the
>> field_id/metadata should be erased in the third field but presumably
>> it could also be the joined metadata from the two origin fields)
>>
>> Thanks for your time,
>>
>> -Weston Pace
>>
>> [1] https://issues.apache.org/jira/browse/PARQUET-1798
>> [2] https://github.com/apache/arrow/pull/10289
>> [3] https://issues.apache.org/jira/browse/ARROW-7080
>> [4] https://github.com/apache/arrow/pull/6408

Re: Usage of parquet field_id

Reply via email to