Re: Iceberg sync notes - 10 March 2021

Yan Yan Mon, 22 Mar 2021 11:33:09 -0700

https://iceberg.apache.org/spec/#iceberg-table-spec is the official doc for
the Iceberg spec; it includes a lot of aspects of V2 spec but is not
comprehensive yet, as the development and documentation for V2 are both
ongoing processes.


Yan

On Mon, Mar 22, 2021 at 8:47 AM Chen Song <chen.song...@gmail.com> wrote:

> Thanks for the clarification. Is
> https://iceberg.apache.org/spec/#iceberg-table-spec the official doc for
> V2 spec? The
> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/
> <https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit>
> is a breakdown of tasks but not the spec itself.
>
>
> On Tue, Mar 16, 2021 at 10:31 PM Anton Okolnychyi
> <aokolnyc...@apple.com.invalid> wrote:
>
>> Yan is absolutely correct.
>>
>> We only leverage the sort order during DELETE/UPDATE/MERGE operations in
>> Spark for now as we handle the plan construction ourselves. There will be
>> an API in Spark 3.2 to request a specific distribution and ordering for
>> normal writes. There are also similar efforts in Flink.
>>
>> - Anton
>>
>> On 16 Mar 2021, at 17:03, Yan Yan <yyany...@gmail.com> wrote:
>>
>> Hi Chen,
>>
>> I think currently the sort order support is mostly only on the Iceberg
>> spec level. The user can specify sort order on table, and ideally writer
>> should use this information on the table to determine the right sort order
>> it should use for writing data, and persist this information to data files.
>> But at this moment we don't have integration between engine and Iceberg
>> library to allow writers to write anything other than 0 (unsorted, which is
>> default) for any data files; and even it's possible, I think we are still
>> lacking engines' support for sort order in general; I think there are
>> active efforts on Spark to support sort order in writing but I'm not sure
>> about the other engines. And yes, it should be the responsibility of the
>> writer to ensure the data is indeed sorted before writing the sort order
>> information to files. And for your second question, I think we don't have
>> this support for now, which is mostly due to the feature still under
>> development for the same reason mentioned above.
>>
>> Thank you,
>> Yan
>>
>>
>> On Tue, Mar 16, 2021 at 2:33 PM Chen Song <chen.song...@gmail.com> wrote:
>>
>>> Thanks Yan. I have a question about sort order support. I saw
>>> https://iceberg.apache.org/spec/#sorting talking about support on
>>> sorting. And I found related tickets like #1373
>>> <https://github.com/apache/iceberg/pull/1373> and #1975
>>> <https://github.com/apache/iceberg/pull/1975>. However, it is not clear
>>> to me how this is enforced end to end.
>>>
>>>    - Currently, it seems that the sort order info can be persisted in
>>>    manifests. On data files, how is this enforced? Is the writer's
>>>    responsibility to ensure the data is sorted before commit based on the 
>>> sort
>>>    order info defined on table level?
>>>    - Assuming data is sorted within each data file. Is the Iceberg core
>>>    reader able to return all data (across partitions possibly) in total 
>>> sorted
>>>    order when reading, based on the sort order information stored in 
>>> manifests?
>>>
>>> Essentially, if we want to support sorting on the underlying data when
>>> read using core data API, what is the right and required things to do?
>>>
>>> Thanks,
>>> Chen
>>>
>>>
>>> On Tue, Mar 16, 2021 at 4:05 PM Yan Yan <yyany...@gmail.com> wrote:
>>>
>>>> Hi Chen,
>>>>
>>>> Here is the doc on remaining tasks for format V2 that I updated with
>>>> the latest status today, including individual PRs pending review and tasks
>>>> needed that are V2-blocking:
>>>> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit
>>>> Please feel free to comment/edit as needed.
>>>>
>>>> As mentioned in Anton's email, it would be great if more people can
>>>> review the pending PRs.
>>>>
>>>> Thank you!
>>>> Yan
>>>>
>>>>
>>>> On Tue, Mar 16, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thanks for the summary. On V2 format. Is there a google doc to review,
>>>>> or any sort of backlog of tickets to track?
>>>>>
>>>>> Chen
>>>>>
>>>>> On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi <
>>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>>
>>>>>> Hey everyone,
>>>>>>
>>>>>> Thanks to folks who attended. I added my notes from the last sync.
>>>>>> Please feel free to add/correct if I missed anything.
>>>>>>
>>>>>> Main points
>>>>>>
>>>>>>    - Highlights
>>>>>>       - StreamingOffset for Structured Streaming in Spark
>>>>>>       - New Actions API
>>>>>>       - Spark procedure for partial import of existing tables
>>>>>>       - Subsurface talks are online
>>>>>>       - Call for papers is open at ApacheCon and Subsurface
>>>>>>    - Releases
>>>>>>       - 0.11.1
>>>>>>          - Waiting for the fix on handling situations when the
>>>>>>          metastore fails during commit (#2317).
>>>>>>       - 0.12.0
>>>>>>          - Should include Spark 3.1 support
>>>>>>          - V2 format items should be included whenever possible but
>>>>>>          should not block the release
>>>>>>          - No new blockers
>>>>>>          - Ideally, end of March
>>>>>>       - Table corruption issue (#2317
>>>>>>    <https://github.com/apache/iceberg/issues/2317>)
>>>>>>       - We may corrupt tables if the metastore fails during commit
>>>>>>       and the commit state is unknown. Iceberg may delete files that were
>>>>>>       actually committed.
>>>>>>       - A lot of folks have seen this issue.
>>>>>>       - Parth has shared some thoughts from a discussion they had
>>>>>>       internally here
>>>>>>       
>>>>>> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k>
>>>>>>       .
>>>>>>       - We can handle this issue in two phases:
>>>>>>          - Don’t corrupt the table (Russell has a PR)
>>>>>>          - Avoid duplicated results if operations are blindly
>>>>>>          retried (can be done in a follow-up PR)
>>>>>>       - Seems worth including the first part in 0.11.1
>>>>>>    - V2 format
>>>>>>       - Open points:
>>>>>>          - Primary key or row id for upserts
>>>>>>          - Propagating the sort order id for files on write
>>>>>>       - Need more reviewers
>>>>>>    - Encryption
>>>>>>       - Multiple people expressed interested in data encryption.
>>>>>>       - Existing work by John here
>>>>>>       <https://github.com/apache/iceberg/pull/1918>.
>>>>>>       - Ideally, should leverage as much as possible of modular
>>>>>>       encryption in Parquet 1.12 discussed here
>>>>>>       <https://github.com/apache/iceberg/issues/1413>.
>>>>>>       - Agreed to start a thread on the dev list.
>>>>>>    - ChachingCatalog issues (#2319
>>>>>>    <https://github.com/apache/iceberg/issues/2319>)
>>>>>>       - The current behavior leads to stale data if multiple
>>>>>>       sessions are used.
>>>>>>       - No ideal solution due to Spark limitations. Agreed to
>>>>>>       discuss in the issue.
>>>>>>    - Multi-table transactions
>>>>>>       - Jacques has proposed an API here
>>>>>>       <https://github.com/apache/iceberg/pull/1849> and is about to
>>>>>>       start working on an implementation.
>>>>>>       - Agreed to collaborate on the dev list. More eyes would be
>>>>>>       great.
>>>>>>
>>>>>>
>>>>>> The link to the doc:
>>>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg
>>>>>>
>>>>>> Thanks,
>>>>>> Anton
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Chen Song
>>>>>
>>>>>
>>>
>>> --
>>> Chen Song
>>>
>>>
>>
>
> --
> Chen Song
>
>

Re: Iceberg sync notes - 10 March 2021

Reply via email to