https://iceberg.apache.org/spec/#iceberg-table-spec is the official doc for the Iceberg spec; it includes a lot of aspects of V2 spec but is not comprehensive yet, as the development and documentation for V2 are both ongoing processes.
Yan On Mon, Mar 22, 2021 at 8:47 AM Chen Song <chen.song...@gmail.com> wrote: > Thanks for the clarification. Is > https://iceberg.apache.org/spec/#iceberg-table-spec the official doc for > V2 spec? The > https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/ > <https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit> > is a breakdown of tasks but not the spec itself. > > > On Tue, Mar 16, 2021 at 10:31 PM Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > >> Yan is absolutely correct. >> >> We only leverage the sort order during DELETE/UPDATE/MERGE operations in >> Spark for now as we handle the plan construction ourselves. There will be >> an API in Spark 3.2 to request a specific distribution and ordering for >> normal writes. There are also similar efforts in Flink. >> >> - Anton >> >> On 16 Mar 2021, at 17:03, Yan Yan <yyany...@gmail.com> wrote: >> >> Hi Chen, >> >> I think currently the sort order support is mostly only on the Iceberg >> spec level. The user can specify sort order on table, and ideally writer >> should use this information on the table to determine the right sort order >> it should use for writing data, and persist this information to data files. >> But at this moment we don't have integration between engine and Iceberg >> library to allow writers to write anything other than 0 (unsorted, which is >> default) for any data files; and even it's possible, I think we are still >> lacking engines' support for sort order in general; I think there are >> active efforts on Spark to support sort order in writing but I'm not sure >> about the other engines. And yes, it should be the responsibility of the >> writer to ensure the data is indeed sorted before writing the sort order >> information to files. And for your second question, I think we don't have >> this support for now, which is mostly due to the feature still under >> development for the same reason mentioned above. >> >> Thank you, >> Yan >> >> >> On Tue, Mar 16, 2021 at 2:33 PM Chen Song <chen.song...@gmail.com> wrote: >> >>> Thanks Yan. I have a question about sort order support. I saw >>> https://iceberg.apache.org/spec/#sorting talking about support on >>> sorting. And I found related tickets like #1373 >>> <https://github.com/apache/iceberg/pull/1373> and #1975 >>> <https://github.com/apache/iceberg/pull/1975>. However, it is not clear >>> to me how this is enforced end to end. >>> >>> - Currently, it seems that the sort order info can be persisted in >>> manifests. On data files, how is this enforced? Is the writer's >>> responsibility to ensure the data is sorted before commit based on the >>> sort >>> order info defined on table level? >>> - Assuming data is sorted within each data file. Is the Iceberg core >>> reader able to return all data (across partitions possibly) in total >>> sorted >>> order when reading, based on the sort order information stored in >>> manifests? >>> >>> Essentially, if we want to support sorting on the underlying data when >>> read using core data API, what is the right and required things to do? >>> >>> Thanks, >>> Chen >>> >>> >>> On Tue, Mar 16, 2021 at 4:05 PM Yan Yan <yyany...@gmail.com> wrote: >>> >>>> Hi Chen, >>>> >>>> Here is the doc on remaining tasks for format V2 that I updated with >>>> the latest status today, including individual PRs pending review and tasks >>>> needed that are V2-blocking: >>>> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit >>>> Please feel free to comment/edit as needed. >>>> >>>> As mentioned in Anton's email, it would be great if more people can >>>> review the pending PRs. >>>> >>>> Thank you! >>>> Yan >>>> >>>> >>>> On Tue, Mar 16, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com> >>>> wrote: >>>> >>>>> Thanks for the summary. On V2 format. Is there a google doc to review, >>>>> or any sort of backlog of tickets to track? >>>>> >>>>> Chen >>>>> >>>>> On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi < >>>>> aokolnyc...@apple.com.invalid> wrote: >>>>> >>>>>> Hey everyone, >>>>>> >>>>>> Thanks to folks who attended. I added my notes from the last sync. >>>>>> Please feel free to add/correct if I missed anything. >>>>>> >>>>>> Main points >>>>>> >>>>>> - Highlights >>>>>> - StreamingOffset for Structured Streaming in Spark >>>>>> - New Actions API >>>>>> - Spark procedure for partial import of existing tables >>>>>> - Subsurface talks are online >>>>>> - Call for papers is open at ApacheCon and Subsurface >>>>>> - Releases >>>>>> - 0.11.1 >>>>>> - Waiting for the fix on handling situations when the >>>>>> metastore fails during commit (#2317). >>>>>> - 0.12.0 >>>>>> - Should include Spark 3.1 support >>>>>> - V2 format items should be included whenever possible but >>>>>> should not block the release >>>>>> - No new blockers >>>>>> - Ideally, end of March >>>>>> - Table corruption issue (#2317 >>>>>> <https://github.com/apache/iceberg/issues/2317>) >>>>>> - We may corrupt tables if the metastore fails during commit >>>>>> and the commit state is unknown. Iceberg may delete files that were >>>>>> actually committed. >>>>>> - A lot of folks have seen this issue. >>>>>> - Parth has shared some thoughts from a discussion they had >>>>>> internally here >>>>>> >>>>>> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k> >>>>>> . >>>>>> - We can handle this issue in two phases: >>>>>> - Don’t corrupt the table (Russell has a PR) >>>>>> - Avoid duplicated results if operations are blindly >>>>>> retried (can be done in a follow-up PR) >>>>>> - Seems worth including the first part in 0.11.1 >>>>>> - V2 format >>>>>> - Open points: >>>>>> - Primary key or row id for upserts >>>>>> - Propagating the sort order id for files on write >>>>>> - Need more reviewers >>>>>> - Encryption >>>>>> - Multiple people expressed interested in data encryption. >>>>>> - Existing work by John here >>>>>> <https://github.com/apache/iceberg/pull/1918>. >>>>>> - Ideally, should leverage as much as possible of modular >>>>>> encryption in Parquet 1.12 discussed here >>>>>> <https://github.com/apache/iceberg/issues/1413>. >>>>>> - Agreed to start a thread on the dev list. >>>>>> - ChachingCatalog issues (#2319 >>>>>> <https://github.com/apache/iceberg/issues/2319>) >>>>>> - The current behavior leads to stale data if multiple >>>>>> sessions are used. >>>>>> - No ideal solution due to Spark limitations. Agreed to >>>>>> discuss in the issue. >>>>>> - Multi-table transactions >>>>>> - Jacques has proposed an API here >>>>>> <https://github.com/apache/iceberg/pull/1849> and is about to >>>>>> start working on an implementation. >>>>>> - Agreed to collaborate on the dev list. More eyes would be >>>>>> great. >>>>>> >>>>>> >>>>>> The link to the doc: >>>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg >>>>>> >>>>>> Thanks, >>>>>> Anton >>>>>> >>>>> >>>>> >>>>> -- >>>>> Chen Song >>>>> >>>>> >>> >>> -- >>> Chen Song >>> >>> >> > > -- > Chen Song > >