> #1 is a problem and we should remove the auto-generation. Sounds like we are aligned.
> I hope that helps. Thanks for the extra details. I've learned a lot and it helps to know how this is used. On Tue, May 18, 2021 at 2:20 PM Ryan Blue <b...@apache.org> wrote: > > Hi Weston, > > #1 is a problem and we should remove the auto-generation. The issue is that > auto-generating an ID can result in a collision between Iceberg's field IDs > and the generated IDs. Since Iceberg uses the ID to identify a field, that > would result in unrelated data being mistaken for a column's data. > > Your description above for #2 is a bit confusing for me. Field IDs are used > to track fields across renames and other schema changes. Those schema changes > don't happen in a single file. A file is written with some schema (which > includes IDs) and later field resolution happens based on ID. I might have a > table with fields `1: a int, 2: b string` that is later evolved to `1: x > long, 3: b string`. Any given data file is written with only one version of > the schema. From the IDs, you can see that field 1 was renamed and promoted > to long, field 2 was deleted, and field 3 was added with field 2's original > name. > > This ID-based approach is an alternative to name-based resolution (like Avro > uses) or position-based resolution (like CSV uses). Both of those resolution > methods are flawed and result in correctness issues: > 1. Name-based resolution can't drop a column and add a new one with the same > name > 2. Position-based resolution can't drop a column in the middle of the schema > > Only ID-based resolution gives you the expected SQL behavior for table > evolution (ADD/DROP/RENAME COLUMN). > > For your original questions: > > * Filtering a table is a matter of selecting columns by ID and running > filters by ID. In Iceberg. we bind the current names in a SQL table to the > field IDs to do this. > * Filling in null values is done by identifying that a column ID is missing > in a data file. Null values are used in place. > * Casting or promoting data is done by strict rules in Iceberg. This is > affected by ID because we know that a field is the same across files, like in > my example above. > * For combining fields, it sounds like you're thinking about operations on > the data and when to carry IDs through an operation. I wouldn't recommend > ever carrying IDs through. In Spark, we use the current schema's names to > produce rows. SQL always uses the current names. And when we write back out > to a table, we use SQL semantics, which are to align by position. > > I hope that helps. If it's not clear, I'm happy to jump on a call to talk > through it with you. > > Ryan > > On Tue, May 18, 2021 at 1:48 PM Weston Pace <weston.p...@gmail.com> wrote: >> >> Ok, this is matching my understanding of how field_id is used as well. >> I believe #1 will not be an issue because I think Iceberg always sets >> the field_id property when writing data? If that is the case then >> Iceberg would never have noticed the old behavior. In other words, >> Iceberg never relied on Arrow to set the field_id. >> >> For #2 I think your example is helpful. The `field_id` is sort of a >> file-specific concept. Once you are at the dataset layer the Iceberg >> schema takes precedence and the field_id is no longer necessary. >> >> Also, thinking about it more generally, metadata is really part of the >> schema / control channel. The compute operations in Arrow are more >> involved with the data channel. "Combining metadata" might be a >> concern of tools that "combine schema" (e.g. dataset evolution) but >> isn't a concern of tools that combine data (e.g. Arrow compute). So >> in that sense the compute operations probably don't need to worry much >> about preserving schema. >> >> This has been helpful to hear how this is used. I needed a concrete >> example to bounce the idea around in my head with. >> >> Thanks, >> >> -Weston >> >> On Tue, May 18, 2021 at 5:48 AM Daniel Weeks <dwe...@apache.org> wrote: >> > >> > Hey Weston, >> > >> > From the Iceberg's perspective, the field_id is necessary to track the >> > evolution of the schema over time. It's best to think of the problem from >> > a dataset perspective as opposed to a file perspective. >> > >> > Iceberg maintains the mapping of the schema with respect to the field ids >> > because as the files in the datasets change, the field names may change, >> > but field id is intended to be persistent and referenceable regardless of >> > name or position within the file. >> > >> > For #1 above, I'm not sure I understand the issue of having the field ids >> > auto-generated. If you're not using the field ids to reference the >> > columns, does it matter if they are present or not? >> > >> > For #2, I would speculate that the field id is less relevant after the >> > initial projection and filtering (it really depends on how the engine >> > wants to track fields at that point, so I would suspect that maybe field >> > id wouldn't be ideal especially after various transforms or aggregations >> > are applied). However, it does matter when persisting the data as the >> > field ids need to be resolved to the target dataset. If it's a new >> > dataset, new field ids can be generated using the original approach. >> > However, if the data is being appended to an existing dataset, the field >> > ids need to be resolved against that target dataset and rewritten before >> > persisting to parquet so they align with the Iceberg schema (in SQL this >> > is done positionally). >> > >> > Let me know if any of that doesn't make sense. I'm still a little unclear >> > on the issue in #1, so it would be helpful if you could clarify that for >> > me. >> > >> > Thanks, >> > Dan >> > >> > On Mon, May 17, 2021 at 8:50 PM Weston Pace <weston.p...@gmail.com> wrote: >> >> >> >> Hello Iceberg devs, >> >> >> >> I'm Weston, I've been working on the Arrow project lately and I am >> >> reviewing how we handle the parquet field_id (and also adding support >> >> for specifying a field_id at write time) from parquet[1][2]. This >> >> has brought up two questions. >> >> >> >> 1. The original PR adding field_id support[3][4] not only allowed the >> >> field_id to pass through from parquet to arrow but also generated ids >> >> (in a depth first fashion) for fields that did not have a field_id. >> >> In retrospect, it seems this auto-generation of field_id was probably >> >> not a good idea. Would it have any impact on Iceberg if we removed >> >> it? Just to be clear, we will still have support for reading (and >> >> now writing) the parquet field_id. I am only talking about removing >> >> the auto-generation of missing values. >> >> >> >> 2. For the second question I'm looking for the Iceberg community's >> >> opinion as users of Arrow. Arrow is enabling more support for >> >> computation on data (e.g. relational operators) and I've been >> >> wondering how those transformations should affect metadata (like the >> >> field_id). For some examples: >> >> >> >> * Filtering a table by column (it seems the field_id/metadata should >> >> remain unchanged) >> >> * Filtering a table by rows (it seems the field_id/metadata should >> >> remain unchanged) >> >> * Filling in null values with a placeholder value (the data is changed >> >> so ???) >> >> * Casting a field to a different data type (the meaning of the data >> >> has changed so ???) >> >> * Combining two fields into a third field (it seems the >> >> field_id/metadata should be erased in the third field but presumably >> >> it could also be the joined metadata from the two origin fields) >> >> >> >> Thanks for your time, >> >> >> >> -Weston Pace >> >> >> >> [1] https://issues.apache.org/jira/browse/PARQUET-1798 >> >> [2] https://github.com/apache/arrow/pull/10289 >> >> [3] https://issues.apache.org/jira/browse/ARROW-7080 >> >> [4] https://github.com/apache/arrow/pull/6408 > > > > -- > Ryan Blue