Re: Iceberg sync notes - 10 March 2021

Chen Song Tue, 16 Mar 2021 08:06:44 -0700

Thanks for the summary. On V2 format. Is there a google doc to review, or
any sort of backlog of tickets to track?


Chen

On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> Hey everyone,
>
> Thanks to folks who attended. I added my notes from the last sync. Please
> feel free to add/correct if I missed anything.
>
> Main points
>
>    - Highlights
>       - StreamingOffset for Structured Streaming in Spark
>       - New Actions API
>       - Spark procedure for partial import of existing tables
>       - Subsurface talks are online
>       - Call for papers is open at ApacheCon and Subsurface
>    - Releases
>       - 0.11.1
>          - Waiting for the fix on handling situations when the metastore
>          fails during commit (#2317).
>       - 0.12.0
>          - Should include Spark 3.1 support
>          - V2 format items should be included whenever possible but
>          should not block the release
>          - No new blockers
>          - Ideally, end of March
>       - Table corruption issue (#2317
>    <https://github.com/apache/iceberg/issues/2317>)
>       - We may corrupt tables if the metastore fails during commit and
>       the commit state is unknown. Iceberg may delete files that were actually
>       committed.
>       - A lot of folks have seen this issue.
>       - Parth has shared some thoughts from a discussion they had
>       internally here
>       
> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k>
>       .
>       - We can handle this issue in two phases:
>          - Don’t corrupt the table (Russell has a PR)
>          - Avoid duplicated results if operations are blindly retried
>          (can be done in a follow-up PR)
>       - Seems worth including the first part in 0.11.1
>    - V2 format
>       - Open points:
>          - Primary key or row id for upserts
>          - Propagating the sort order id for files on write
>       - Need more reviewers
>    - Encryption
>       - Multiple people expressed interested in data encryption.
>       - Existing work by John here
>       <https://github.com/apache/iceberg/pull/1918>.
>       - Ideally, should leverage as much as possible of modular
>       encryption in Parquet 1.12 discussed here
>       <https://github.com/apache/iceberg/issues/1413>.
>       - Agreed to start a thread on the dev list.
>    - ChachingCatalog issues (#2319
>    <https://github.com/apache/iceberg/issues/2319>)
>       - The current behavior leads to stale data if multiple sessions are
>       used.
>       - No ideal solution due to Spark limitations. Agreed to discuss in
>       the issue.
>    - Multi-table transactions
>       - Jacques has proposed an API here
>       <https://github.com/apache/iceberg/pull/1849> and is about to start
>       working on an implementation.
>       - Agreed to collaborate on the dev list. More eyes would be great.
>
>
> The link to the doc:
> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg
>
> Thanks,
> Anton
>


-- 
Chen Song

Re: Iceberg sync notes - 10 March 2021

Reply via email to