Re: Usage of parquet field_id

Weston Pace Thu, 20 May 2021 07:23:44 -0700

> #1 is a problem and we should remove the auto-generation.

Sounds like we are aligned.


> I hope that helps.

Thanks for the extra details.  I've learned a lot and it helps to know
how this is used.

On Tue, May 18, 2021 at 2:20 PM Ryan Blue <b...@apache.org> wrote:
>
> Hi Weston,
>
> #1 is a problem and we should remove the auto-generation. The issue is that 
> auto-generating an ID can result in a collision between Iceberg's field IDs 
> and the generated IDs. Since Iceberg uses the ID to identify a field, that 
> would result in unrelated data being mistaken for a column's data.
>
> Your description above for #2 is a bit confusing for me. Field IDs are used 
> to track fields across renames and other schema changes. Those schema changes 
> don't happen in a single file. A file is written with some schema (which 
> includes IDs) and later field resolution happens based on ID. I might have a 
> table with fields `1: a int, 2: b string` that is later evolved to `1: x 
> long, 3: b string`. Any given data file is written with only one version of 
> the schema. From the IDs, you can see that field 1 was renamed and promoted 
> to long, field 2 was deleted, and field 3 was added with field 2's original 
> name.
>
> This ID-based approach is an alternative to name-based resolution (like Avro 
> uses) or position-based resolution (like CSV uses). Both of those resolution 
> methods are flawed and result in correctness issues:
> 1. Name-based resolution can't drop a column and add a new one with the same 
> name
> 2. Position-based resolution can't drop a column in the middle of the schema
>
> Only ID-based resolution gives you the expected SQL behavior for table 
> evolution (ADD/DROP/RENAME COLUMN).
>
> For your original questions:
>
> * Filtering a table is a matter of selecting columns by ID and running 
> filters by ID. In Iceberg. we bind the current names in a SQL table to the 
> field IDs to do this.
> * Filling in null values is done by identifying that a column ID is missing 
> in a data file. Null values are used in place.
> * Casting or promoting data is done by strict rules in Iceberg. This is 
> affected by ID because we know that a field is the same across files, like in 
> my example above.
> * For combining fields, it sounds like you're thinking about operations on 
> the data and when to carry IDs through an operation. I wouldn't recommend 
> ever carrying IDs through. In Spark, we use the current schema's names to 
> produce rows. SQL always uses the current names. And when we write back out 
> to a table, we use SQL semantics, which are to align by position.
>
> I hope that helps. If it's not clear, I'm happy to jump on a call to talk 
> through it with you.
>
> Ryan
>
> On Tue, May 18, 2021 at 1:48 PM Weston Pace <weston.p...@gmail.com> wrote:
>>
>> Ok, this is matching my understanding of how field_id is used as well.
>> I believe #1 will not be an issue because I think Iceberg always sets
>> the field_id property when writing data?  If that is the case then
>> Iceberg would never have noticed the old behavior.  In other words,
>> Iceberg never relied on Arrow to set the field_id.
>>
>> For #2 I think your example is helpful.  The `field_id` is sort of a
>> file-specific concept.  Once you are at the dataset layer the Iceberg
>> schema takes precedence and the field_id is no longer necessary.
>>
>> Also, thinking about it more generally, metadata is really part of the
>> schema / control channel.  The compute operations in Arrow are more
>> involved with the data channel.  "Combining metadata" might be a
>> concern of tools that "combine schema" (e.g. dataset evolution) but
>> isn't a concern of tools that combine data (e.g. Arrow compute).  So
>> in that sense the compute operations probably don't need to worry much
>> about preserving schema.
>>
>> This has been helpful to hear how this is used.  I needed a concrete
>> example to bounce the idea around in my head with.
>>
>> Thanks,
>>
>> -Weston
>>
>> On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <dwe...@apache.org> wrote:
>> >
>> > Hey Weston,
>> >
>> > From the Iceberg's perspective, the field_id is necessary to track the 
>> > evolution of the schema over time.  It's best to think of the problem from 
>> > a dataset perspective as opposed to a file perspective.
>> >
>> > Iceberg maintains the mapping of the schema with respect to the field ids 
>> > because as the files in the datasets change, the field names may change, 
>> > but field id is intended to be persistent and referenceable regardless of 
>> > name or position within the file.
>> >
>> > For #1 above, I'm not sure I understand the issue of having the field ids 
>> > auto-generated.  If you're not using the field ids to reference the 
>> > columns, does it matter if they are present or not?
>> >
>> > For #2, I would speculate that the field id is less relevant after the 
>> > initial projection and filtering (it really depends on how the engine 
>> > wants to track fields at that point, so I would suspect that maybe field 
>> > id wouldn't be ideal especially after various transforms or aggregations 
>> > are applied).  However, it does matter when persisting the data as the 
>> > field ids need to be resolved to the target dataset.  If it's a new 
>> > dataset, new field ids can be generated using the original approach.  
>> > However, if the data is being appended to an existing dataset, the field 
>> > ids need to be resolved against that target dataset and rewritten before 
>> > persisting to parquet so they align with the Iceberg schema (in SQL this 
>> > is done positionally).
>> >
>> > Let me know if any of that doesn't make sense.  I'm still a little unclear 
>> > on the issue in #1, so it would be helpful if you could clarify that for 
>> > me.
>> >
>> > Thanks,
>> > Dan
>> >
>> > On Mon, May 17, 2021 at 8:50 PM Weston Pace <weston.p...@gmail.com> wrote:
>> >>
>> >> Hello Iceberg devs,
>> >>
>> >> I'm Weston, I've been working on the Arrow project lately and I am
>> >> reviewing how we handle the parquet field_id (and also adding support
>> >> for specifying a field_id at write time) from parquet[1][2].   This
>> >> has brought up two questions.
>> >>
>> >>  1. The original PR adding field_id support[3][4] not only allowed the
>> >> field_id to pass through from parquet to arrow but also generated ids
>> >> (in a depth first fashion) for fields that did not have a field_id.
>> >> In retrospect, it seems this auto-generation of field_id was probably
>> >> not a good idea.  Would it have any impact on Iceberg if we removed
>> >> it?  Just to be clear, we will still have support  for reading (and
>> >> now writing) the parquet field_id.  I am only talking about removing
>> >> the auto-generation of missing values.
>> >>
>> >>  2. For the second question I'm looking for the Iceberg community's
>> >> opinion as users of Arrow.  Arrow is enabling more support for
>> >> computation on data (e.g. relational operators) and I've been
>> >> wondering how those transformations should affect metadata (like the
>> >> field_id).  For some examples:
>> >>
>> >>  * Filtering a table by column (it seems the field_id/metadata should
>> >> remain unchanged)
>> >>  * Filtering a table by rows (it seems the field_id/metadata should
>> >> remain unchanged)
>> >>  * Filling in null values with a placeholder value (the data is changed 
>> >> so ???)
>> >>  * Casting a field to a different data type (the meaning of the data
>> >> has changed so ???)
>> >>  * Combining two fields into a third field (it seems the
>> >> field_id/metadata should be erased in the third field but presumably
>> >> it could also be the joined metadata from the two origin fields)
>> >>
>> >> Thanks for your time,
>> >>
>> >> -Weston Pace
>> >>
>> >> [1] https://issues.apache.org/jira/browse/PARQUET-1798
>> >> [2] https://github.com/apache/arrow/pull/10289
>> >> [3] https://issues.apache.org/jira/browse/ARROW-7080
>> >> [4] https://github.com/apache/arrow/pull/6408
>
>
>
> --
> Ryan Blue

Re: Usage of parquet field_id

Reply via email to