Re: Iceberg sync notes - 10 March 2021

Yan Yan Tue, 16 Mar 2021 13:05:32 -0700

Hi Chen,

Here is the doc on remaining tasks for format V2 that I updated with the
latest status today, including individual PRs pending review and tasks
needed that are V2-blocking:
https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit
Please feel free to comment/edit as needed.


As mentioned in Anton's email, it would be great if more people can review
the pending PRs.

Thank you!
Yan


On Tue, Mar 16, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com> wrote:

> Thanks for the summary. On V2 format. Is there a google doc to review, or
> any sort of backlog of tickets to track?
>
> Chen
>
> On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi
> <aokolnyc...@apple.com.invalid> wrote:
>
>> Hey everyone,
>>
>> Thanks to folks who attended. I added my notes from the last sync. Please
>> feel free to add/correct if I missed anything.
>>
>> Main points
>>
>>    - Highlights
>>       - StreamingOffset for Structured Streaming in Spark
>>       - New Actions API
>>       - Spark procedure for partial import of existing tables
>>       - Subsurface talks are online
>>       - Call for papers is open at ApacheCon and Subsurface
>>    - Releases
>>       - 0.11.1
>>          - Waiting for the fix on handling situations when the metastore
>>          fails during commit (#2317).
>>       - 0.12.0
>>          - Should include Spark 3.1 support
>>          - V2 format items should be included whenever possible but
>>          should not block the release
>>          - No new blockers
>>          - Ideally, end of March
>>       - Table corruption issue (#2317
>>    <https://github.com/apache/iceberg/issues/2317>)
>>       - We may corrupt tables if the metastore fails during commit and
>>       the commit state is unknown. Iceberg may delete files that were 
>> actually
>>       committed.
>>       - A lot of folks have seen this issue.
>>       - Parth has shared some thoughts from a discussion they had
>>       internally here
>>       
>> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k>
>>       .
>>       - We can handle this issue in two phases:
>>          - Don’t corrupt the table (Russell has a PR)
>>          - Avoid duplicated results if operations are blindly retried
>>          (can be done in a follow-up PR)
>>       - Seems worth including the first part in 0.11.1
>>    - V2 format
>>       - Open points:
>>          - Primary key or row id for upserts
>>          - Propagating the sort order id for files on write
>>       - Need more reviewers
>>    - Encryption
>>       - Multiple people expressed interested in data encryption.
>>       - Existing work by John here
>>       <https://github.com/apache/iceberg/pull/1918>.
>>       - Ideally, should leverage as much as possible of modular
>>       encryption in Parquet 1.12 discussed here
>>       <https://github.com/apache/iceberg/issues/1413>.
>>       - Agreed to start a thread on the dev list.
>>    - ChachingCatalog issues (#2319
>>    <https://github.com/apache/iceberg/issues/2319>)
>>       - The current behavior leads to stale data if multiple sessions
>>       are used.
>>       - No ideal solution due to Spark limitations. Agreed to discuss in
>>       the issue.
>>    - Multi-table transactions
>>       - Jacques has proposed an API here
>>       <https://github.com/apache/iceberg/pull/1849> and is about to
>>       start working on an implementation.
>>       - Agreed to collaborate on the dev list. More eyes would be great.
>>
>>
>> The link to the doc:
>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg
>>
>> Thanks,
>> Anton
>>
>
>
> --
> Chen Song
>
>

Re: Iceberg sync notes - 10 March 2021

Reply via email to