Thanks for the clarification. Is https://iceberg.apache.org/spec/#iceberg-table-spec the official doc for V2 spec? The https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/ <https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit> is a breakdown of tasks but not the spec itself.
On Tue, Mar 16, 2021 at 10:31 PM Anton Okolnychyi <aokolnyc...@apple.com.invalid> wrote: > Yan is absolutely correct. > > We only leverage the sort order during DELETE/UPDATE/MERGE operations in > Spark for now as we handle the plan construction ourselves. There will be > an API in Spark 3.2 to request a specific distribution and ordering for > normal writes. There are also similar efforts in Flink. > > - Anton > > On 16 Mar 2021, at 17:03, Yan Yan <yyany...@gmail.com> wrote: > > Hi Chen, > > I think currently the sort order support is mostly only on the Iceberg > spec level. The user can specify sort order on table, and ideally writer > should use this information on the table to determine the right sort order > it should use for writing data, and persist this information to data files. > But at this moment we don't have integration between engine and Iceberg > library to allow writers to write anything other than 0 (unsorted, which is > default) for any data files; and even it's possible, I think we are still > lacking engines' support for sort order in general; I think there are > active efforts on Spark to support sort order in writing but I'm not sure > about the other engines. And yes, it should be the responsibility of the > writer to ensure the data is indeed sorted before writing the sort order > information to files. And for your second question, I think we don't have > this support for now, which is mostly due to the feature still under > development for the same reason mentioned above. > > Thank you, > Yan > > > On Tue, Mar 16, 2021 at 2:33 PM Chen Song <chen.song...@gmail.com> wrote: > >> Thanks Yan. I have a question about sort order support. I saw >> https://iceberg.apache.org/spec/#sorting talking about support on >> sorting. And I found related tickets like #1373 >> <https://github.com/apache/iceberg/pull/1373> and #1975 >> <https://github.com/apache/iceberg/pull/1975>. However, it is not clear >> to me how this is enforced end to end. >> >> - Currently, it seems that the sort order info can be persisted in >> manifests. On data files, how is this enforced? Is the writer's >> responsibility to ensure the data is sorted before commit based on the >> sort >> order info defined on table level? >> - Assuming data is sorted within each data file. Is the Iceberg core >> reader able to return all data (across partitions possibly) in total >> sorted >> order when reading, based on the sort order information stored in >> manifests? >> >> Essentially, if we want to support sorting on the underlying data when >> read using core data API, what is the right and required things to do? >> >> Thanks, >> Chen >> >> >> On Tue, Mar 16, 2021 at 4:05 PM Yan Yan <yyany...@gmail.com> wrote: >> >>> Hi Chen, >>> >>> Here is the doc on remaining tasks for format V2 that I updated with the >>> latest status today, including individual PRs pending review and tasks >>> needed that are V2-blocking: >>> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit >>> Please feel free to comment/edit as needed. >>> >>> As mentioned in Anton's email, it would be great if more people can >>> review the pending PRs. >>> >>> Thank you! >>> Yan >>> >>> >>> On Tue, Mar 16, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com> >>> wrote: >>> >>>> Thanks for the summary. On V2 format. Is there a google doc to review, >>>> or any sort of backlog of tickets to track? >>>> >>>> Chen >>>> >>>> On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi < >>>> aokolnyc...@apple.com.invalid> wrote: >>>> >>>>> Hey everyone, >>>>> >>>>> Thanks to folks who attended. I added my notes from the last sync. >>>>> Please feel free to add/correct if I missed anything. >>>>> >>>>> Main points >>>>> >>>>> - Highlights >>>>> - StreamingOffset for Structured Streaming in Spark >>>>> - New Actions API >>>>> - Spark procedure for partial import of existing tables >>>>> - Subsurface talks are online >>>>> - Call for papers is open at ApacheCon and Subsurface >>>>> - Releases >>>>> - 0.11.1 >>>>> - Waiting for the fix on handling situations when the >>>>> metastore fails during commit (#2317). >>>>> - 0.12.0 >>>>> - Should include Spark 3.1 support >>>>> - V2 format items should be included whenever possible but >>>>> should not block the release >>>>> - No new blockers >>>>> - Ideally, end of March >>>>> - Table corruption issue (#2317 >>>>> <https://github.com/apache/iceberg/issues/2317>) >>>>> - We may corrupt tables if the metastore fails during commit >>>>> and the commit state is unknown. Iceberg may delete files that were >>>>> actually committed. >>>>> - A lot of folks have seen this issue. >>>>> - Parth has shared some thoughts from a discussion they had >>>>> internally here >>>>> >>>>> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k> >>>>> . >>>>> - We can handle this issue in two phases: >>>>> - Don’t corrupt the table (Russell has a PR) >>>>> - Avoid duplicated results if operations are blindly retried >>>>> (can be done in a follow-up PR) >>>>> - Seems worth including the first part in 0.11.1 >>>>> - V2 format >>>>> - Open points: >>>>> - Primary key or row id for upserts >>>>> - Propagating the sort order id for files on write >>>>> - Need more reviewers >>>>> - Encryption >>>>> - Multiple people expressed interested in data encryption. >>>>> - Existing work by John here >>>>> <https://github.com/apache/iceberg/pull/1918>. >>>>> - Ideally, should leverage as much as possible of modular >>>>> encryption in Parquet 1.12 discussed here >>>>> <https://github.com/apache/iceberg/issues/1413>. >>>>> - Agreed to start a thread on the dev list. >>>>> - ChachingCatalog issues (#2319 >>>>> <https://github.com/apache/iceberg/issues/2319>) >>>>> - The current behavior leads to stale data if multiple sessions >>>>> are used. >>>>> - No ideal solution due to Spark limitations. Agreed to discuss >>>>> in the issue. >>>>> - Multi-table transactions >>>>> - Jacques has proposed an API here >>>>> <https://github.com/apache/iceberg/pull/1849> and is about to >>>>> start working on an implementation. >>>>> - Agreed to collaborate on the dev list. More eyes would be >>>>> great. >>>>> >>>>> >>>>> The link to the doc: >>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg >>>>> >>>>> Thanks, >>>>> Anton >>>>> >>>> >>>> >>>> -- >>>> Chen Song >>>> >>>> >> >> -- >> Chen Song >> >> > -- Chen Song