Re: Iceberg sync notes - 10 March 2021

Chen Song Mon, 22 Mar 2021 08:47:06 -0700

Thanks for the clarification. Is
https://iceberg.apache.org/spec/#iceberg-table-spec the official doc for V2
spec? The
https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/
<https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit>
is a breakdown of tasks but not the spec itself.



On Tue, Mar 16, 2021 at 10:31 PM Anton Okolnychyi
<aokolnyc...@apple.com.invalid> wrote:

> Yan is absolutely correct.
>
> We only leverage the sort order during DELETE/UPDATE/MERGE operations in
> Spark for now as we handle the plan construction ourselves. There will be
> an API in Spark 3.2 to request a specific distribution and ordering for
> normal writes. There are also similar efforts in Flink.
>
> - Anton
>
> On 16 Mar 2021, at 17:03, Yan Yan <yyany...@gmail.com> wrote:
>
> Hi Chen,
>
> I think currently the sort order support is mostly only on the Iceberg
> spec level. The user can specify sort order on table, and ideally writer
> should use this information on the table to determine the right sort order
> it should use for writing data, and persist this information to data files.
> But at this moment we don't have integration between engine and Iceberg
> library to allow writers to write anything other than 0 (unsorted, which is
> default) for any data files; and even it's possible, I think we are still
> lacking engines' support for sort order in general; I think there are
> active efforts on Spark to support sort order in writing but I'm not sure
> about the other engines. And yes, it should be the responsibility of the
> writer to ensure the data is indeed sorted before writing the sort order
> information to files. And for your second question, I think we don't have
> this support for now, which is mostly due to the feature still under
> development for the same reason mentioned above.
>
> Thank you,
> Yan
>
>
> On Tue, Mar 16, 2021 at 2:33 PM Chen Song <chen.song...@gmail.com> wrote:
>
>> Thanks Yan. I have a question about sort order support. I saw
>> https://iceberg.apache.org/spec/#sorting talking about support on
>> sorting. And I found related tickets like #1373
>> <https://github.com/apache/iceberg/pull/1373> and #1975
>> <https://github.com/apache/iceberg/pull/1975>. However, it is not clear
>> to me how this is enforced end to end.
>>
>>    - Currently, it seems that the sort order info can be persisted in
>>    manifests. On data files, how is this enforced? Is the writer's
>>    responsibility to ensure the data is sorted before commit based on the 
>> sort
>>    order info defined on table level?
>>    - Assuming data is sorted within each data file. Is the Iceberg core
>>    reader able to return all data (across partitions possibly) in total 
>> sorted
>>    order when reading, based on the sort order information stored in 
>> manifests?
>>
>> Essentially, if we want to support sorting on the underlying data when
>> read using core data API, what is the right and required things to do?
>>
>> Thanks,
>> Chen
>>
>>
>> On Tue, Mar 16, 2021 at 4:05 PM Yan Yan <yyany...@gmail.com> wrote:
>>
>>> Hi Chen,
>>>
>>> Here is the doc on remaining tasks for format V2 that I updated with the
>>> latest status today, including individual PRs pending review and tasks
>>> needed that are V2-blocking:
>>> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit
>>> Please feel free to comment/edit as needed.
>>>
>>> As mentioned in Anton's email, it would be great if more people can
>>> review the pending PRs.
>>>
>>> Thank you!
>>> Yan
>>>
>>>
>>> On Tue, Mar 16, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com>
>>> wrote:
>>>
>>>> Thanks for the summary. On V2 format. Is there a google doc to review,
>>>> or any sort of backlog of tickets to track?
>>>>
>>>> Chen
>>>>
>>>> On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi <
>>>> aokolnyc...@apple.com.invalid> wrote:
>>>>
>>>>> Hey everyone,
>>>>>
>>>>> Thanks to folks who attended. I added my notes from the last sync.
>>>>> Please feel free to add/correct if I missed anything.
>>>>>
>>>>> Main points
>>>>>
>>>>>    - Highlights
>>>>>       - StreamingOffset for Structured Streaming in Spark
>>>>>       - New Actions API
>>>>>       - Spark procedure for partial import of existing tables
>>>>>       - Subsurface talks are online
>>>>>       - Call for papers is open at ApacheCon and Subsurface
>>>>>    - Releases
>>>>>       - 0.11.1
>>>>>          - Waiting for the fix on handling situations when the
>>>>>          metastore fails during commit (#2317).
>>>>>       - 0.12.0
>>>>>          - Should include Spark 3.1 support
>>>>>          - V2 format items should be included whenever possible but
>>>>>          should not block the release
>>>>>          - No new blockers
>>>>>          - Ideally, end of March
>>>>>       - Table corruption issue (#2317
>>>>>    <https://github.com/apache/iceberg/issues/2317>)
>>>>>       - We may corrupt tables if the metastore fails during commit
>>>>>       and the commit state is unknown. Iceberg may delete files that were
>>>>>       actually committed.
>>>>>       - A lot of folks have seen this issue.
>>>>>       - Parth has shared some thoughts from a discussion they had
>>>>>       internally here
>>>>>       
>>>>> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k>
>>>>>       .
>>>>>       - We can handle this issue in two phases:
>>>>>          - Don’t corrupt the table (Russell has a PR)
>>>>>          - Avoid duplicated results if operations are blindly retried
>>>>>          (can be done in a follow-up PR)
>>>>>       - Seems worth including the first part in 0.11.1
>>>>>    - V2 format
>>>>>       - Open points:
>>>>>          - Primary key or row id for upserts
>>>>>          - Propagating the sort order id for files on write
>>>>>       - Need more reviewers
>>>>>    - Encryption
>>>>>       - Multiple people expressed interested in data encryption.
>>>>>       - Existing work by John here
>>>>>       <https://github.com/apache/iceberg/pull/1918>.
>>>>>       - Ideally, should leverage as much as possible of modular
>>>>>       encryption in Parquet 1.12 discussed here
>>>>>       <https://github.com/apache/iceberg/issues/1413>.
>>>>>       - Agreed to start a thread on the dev list.
>>>>>    - ChachingCatalog issues (#2319
>>>>>    <https://github.com/apache/iceberg/issues/2319>)
>>>>>       - The current behavior leads to stale data if multiple sessions
>>>>>       are used.
>>>>>       - No ideal solution due to Spark limitations. Agreed to discuss
>>>>>       in the issue.
>>>>>    - Multi-table transactions
>>>>>       - Jacques has proposed an API here
>>>>>       <https://github.com/apache/iceberg/pull/1849> and is about to
>>>>>       start working on an implementation.
>>>>>       - Agreed to collaborate on the dev list. More eyes would be
>>>>>       great.
>>>>>
>>>>>
>>>>> The link to the doc:
>>>>> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg
>>>>>
>>>>> Thanks,
>>>>> Anton
>>>>>
>>>>
>>>>
>>>> --
>>>> Chen Song
>>>>
>>>>
>>
>> --
>> Chen Song
>>
>>
>

-- 
Chen Song

Re: Iceberg sync notes - 10 March 2021

Reply via email to