Re: Iceberg sync notes - 10 March 2021

Chen Song Tue, 16 Mar 2021 14:33:20 -0700

Thanks Yan. I have a question about sort order support. I saw
https://iceberg.apache.org/spec/#sorting talking about support on sorting. And
I found related tickets like #1373
<https://github.com/apache/iceberg/pull/1373> and #1975
<https://github.com/apache/iceberg/pull/1975>. However, it is not clear to
me how this is enforced end to end.


   - Currently, it seems that the sort order info can be persisted in
   manifests. On data files, how is this enforced? Is the writer's
   responsibility to ensure the data is sorted before commit based on the sort
   order info defined on table level?
   - Assuming data is sorted within each data file. Is the Iceberg core
   reader able to return all data (across partitions possibly) in total sorted
   order when reading, based on the sort order information stored in manifests?

Essentially, if we want to support sorting on the underlying data when read
using core data API, what is the right and required things to do?

Thanks,
Chen


On Tue, Mar 16, 2021 at 4:05 PM Yan Yan <[email protected]> wrote:

> Hi Chen,
>
> Here is the doc on remaining tasks for format V2 that I updated with the
> latest status today, including individual PRs pending review and tasks
> needed that are V2-blocking:
> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit
> Please feel free to comment/edit as needed.
>
> As mentioned in Anton's email, it would be great if more people can review
> the pending PRs.
>
> Thank you!
> Yan
>
>
> On Tue, Mar 16, 2021 at 8:06 AM Chen Song <[email protected]> wrote:
>
>> Thanks for the summary. On V2 format. Is there a google doc to review, or
>> any sort of backlog of tickets to track?
>>
>> Chen
>>
>> On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi
>> <[email protected]> wrote:
>>
>>> Hey everyone,
>>>
>>> Thanks to folks who attended. I added my notes from the last sync.
>>> Please feel free to add/correct if I missed anything.
>>>
>>> Main points
>>>
>>>    - Highlights
>>>       - StreamingOffset for Structured Streaming in Spark
>>>       - New Actions API
>>>       - Spark procedure for partial import of existing tables
>>>       - Subsurface talks are online
>>>       - Call for papers is open at ApacheCon and Subsurface
>>>    - Releases
>>>       - 0.11.1
>>>          - Waiting for the fix on handling situations when the
>>>          metastore fails during commit (#2317).
>>>       - 0.12.0
>>>          - Should include Spark 3.1 support
>>>          - V2 format items should be included whenever possible but
>>>          should not block the release
>>>          - No new blockers
>>>          - Ideally, end of March
>>>       - Table corruption issue (#2317
>>>    <https://github.com/apache/iceberg/issues/2317>)
>>>       - We may corrupt tables if the metastore fails during commit and
>>>       the commit state is unknown. Iceberg may delete files that were 
>>> actually
>>>       committed.
>>>       - A lot of folks have seen this issue.
>>>       - Parth has shared some thoughts from a discussion they had
>>>       internally here
>>>       
>>> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k>
>>>       .
>>>       - We can handle this issue in two phases:
>>>          - Don’t corrupt the table (Russell has a PR)
>>>          - Avoid duplicated results if operations are blindly retried
>>>          (can be done in a follow-up PR)
>>>       - Seems worth including the first part in 0.11.1
>>>    - V2 format
>>>       - Open points:
>>>          - Primary key or row id for upserts
>>>          - Propagating the sort order id for files on write
>>>       - Need more reviewers
>>>    - Encryption
>>>       - Multiple people expressed interested in data encryption.
>>>       - Existing work by John here
>>>       <https://github.com/apache/iceberg/pull/1918>.
>>>       - Ideally, should leverage as much as possible of modular
>>>       encryption in Parquet 1.12 discussed here
>>>       <https://github.com/apache/iceberg/issues/1413>.
>>>       - Agreed to start a thread on the dev list.
>>>    - ChachingCatalog issues (#2319
>>>    <https://github.com/apache/iceberg/issues/2319>)
>>>       - The current behavior leads to stale data if multiple sessions
>>>       are used.
>>>       - No ideal solution due to Spark limitations. Agreed to discuss
>>>       in the issue.
>>>    - Multi-table transactions
>>>       - Jacques has proposed an API here
>>>       <https://github.com/apache/iceberg/pull/1849> and is about to
>>>       start working on an implementation.
>>>       - Agreed to collaborate on the dev list. More eyes would be great.
>>>
>>>
>>> The link to the doc:
>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg
>>>
>>> Thanks,
>>> Anton
>>>
>>
>>
>> --
>> Chen Song
>>
>>

-- 
Chen Song

Re: Iceberg sync notes - 10 March 2021

Reply via email to