Is there anything that Iceberg needs to do differently here? We've had requests to support reordering fields with `ADD COLUMN ... AFTER other_col` and `UPDATE COLUMN col BEFORE other_col`. Otherwise, do you think we need to change the internal checks?
On Thu, Sep 26, 2019 at 1:23 AM Gautam <gautamkows...@gmail.com> wrote: > Shone and I synced offline but wanted to circle back here so others can > hopefully benefit and others with more experience with this can correct me > if there's a better way to achieve this. > > *Problem*: > The use case is that incoming data has fields out of order w.r.t > already ingested data in Iceberg. This same scenario applies to nested > columns as well (e.g. fields in a sub-struct has fields out of order) . > Also Incoming data might have added fields. Issue is if data is ingested as > is Iceberg will complain with it's compatibility checks. As it should. > > *Solution*: > Iceberg doesn't depend on field names nor natural order of fields. It > uses Ids to keep track of schema fields. So if one wants to > enforce evolution rules correctly she should first go back to the > underlying Iceberg schema and apply schema transformation rules using > Iceberg Schema Update Api and commit the schema changes to the underlying > table. Once this is done Iceberg will have created a new version of the > schema with new Ids allotted to the added fields. It also accounts for > different order in the incoming data as it keeps the id-name mapping for > all columns. > > Here is a gist that captures these scenarios described above with sample > data : https://gist.github.com/prodeezy/b2cc35b87fca7d43ae681d45b3d7cab3 > > Cheers, > -Gautam. > > > > > > > > On Wed, Sep 25, 2019 at 5:29 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> Hi Shone, >> >> Iceberg should be able to handle out of order data columns in nested >> structures. We probably just need to relax that compatibility check to >> allow it. Can you post the error message that you're getting? >> >> On Sun, Sep 22, 2019 at 4:49 AM Shone Sadler <ssad...@adobe.com.invalid> >> wrote: >> >>> Hello everyone, >>> >>> This question is related to schema evolution support in Iceberg. >>> >>> We have data coming in with fields out-of-order wrt to the schema in >>> Iceberg (e.g. inbound struct(a,b,c) vs. iceberg struct(c,b,a)) >>> >>> As a result we are hitting the following error in Iceberg when saving >>> the data -> "Cannot write incompatible dataset to table with schema", >>> generated within the IcebergeSource -> >>> https://github.com/apache/incubator-iceberg/blob/d1f0b540f5f14f002be86133ef9f66445f7e0926/spark/src/main/java/org/apache/iceberg/spark/source/IcebergSource.java#L157 >>> >>> I also noted in the documentation that re-ordering was allowed -> >>> https://iceberg.apache.org/evolution/ , which led me to believe that we >>> could update the schema prior to writing the data, However, I see no means >>> of re-ordering fields on the current UpdateSchema API. >>> >>> How are people handling out-of-order fields today? >>> >>> Our data is deeply nested, as a result I am hoping not to have to >>> transform/prep on ingest and looking for alternatives. >>> >>> Any thoughts appreciated! >>> >>> Regards, >>> Shone Sadler >>> >>> >>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix