Re: Iceberg sync notes - 10 March 2021

Anton Okolnychyi Tue, 16 Mar 2021 19:31:52 -0700

Yan is absolutely correct. 

We only leverage the sort order during DELETE/UPDATE/MERGE operations in Spark 
for now as we handle the plan construction ourselves. There will be an API in 
Spark 3.2 to request a specific distribution and ordering for normal writes. 
There are also similar efforts in Flink.


- Anton

> On 16 Mar 2021, at 17:03, Yan Yan <yyany...@gmail.com> wrote:
> 
> Hi Chen,
> 
> I think currently the sort order support is mostly only on the Iceberg spec 
> level. The user can specify sort order on table, and ideally writer should 
> use this information on the table to determine the right sort order it should 
> use for writing data, and persist this information to data files. But at this 
> moment we don't have integration between engine and Iceberg library to allow 
> writers to write anything other than 0 (unsorted, which is default) for any 
> data files; and even it's possible, I think we are still lacking engines' 
> support for sort order in general; I think there are active efforts on Spark 
> to support sort order in writing but I'm not sure about the other engines. 
> And yes, it should be the responsibility of the writer to ensure the data is 
> indeed sorted before writing the sort order information to files. And for 
> your second question, I think we don't have this support for now, which is 
> mostly due to the feature still under development for the same reason 
> mentioned above. 
> 
> Thank you,
> Yan
> 
> 
> On Tue, Mar 16, 2021 at 2:33 PM Chen Song <chen.song...@gmail.com 
> <mailto:chen.song...@gmail.com>> wrote:
> Thanks Yan. I have a question about sort order support. I saw 
> https://iceberg.apache.org/spec/#sorting 
> <https://iceberg.apache.org/spec/#sorting> talking about support on sorting. 
> And I found related tickets like #1373 
> <https://github.com/apache/iceberg/pull/1373> and #1975 
> <https://github.com/apache/iceberg/pull/1975>. However, it is not clear to me 
> how this is enforced end to end.
> Currently, it seems that the sort order info can be persisted in manifests. 
> On data files, how is this enforced? Is the writer's responsibility to ensure 
> the data is sorted before commit based on the sort order info defined on 
> table level?
> Assuming data is sorted within each data file. Is the Iceberg core reader 
> able to return all data (across partitions possibly) in total sorted order 
> when reading, based on the sort order information stored in manifests?
> Essentially, if we want to support sorting on the underlying data when read 
> using core data API, what is the right and required things to do?
> 
> Thanks,
> Chen
> 
> 
> On Tue, Mar 16, 2021 at 4:05 PM Yan Yan <yyany...@gmail.com 
> <mailto:yyany...@gmail.com>> wrote:
> Hi Chen,
> 
> Here is the doc on remaining tasks for format V2 that I updated with the 
> latest status today, including individual PRs pending review and tasks needed 
> that are V2-blocking: 
> https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit
>  
> <https://docs.google.com/document/d/1FyLJyvzcZbfbjwDMEZd6Dj-LYCfrzK1zC-Bkb3OiICc/edit>
>  Please feel free to comment/edit as needed. 
> 
> As mentioned in Anton's email, it would be great if more people can review 
> the pending PRs.
> 
> Thank you!
> Yan
> 
> 
> On Tue, Mar 16, 2021 at 8:06 AM Chen Song <chen.song...@gmail.com 
> <mailto:chen.song...@gmail.com>> wrote:
> Thanks for the summary. On V2 format. Is there a google doc to review, or any 
> sort of backlog of tickets to track?
> 
> Chen
> 
> On Mon, Mar 15, 2021 at 10:34 PM Anton Okolnychyi 
> <aokolnyc...@apple.com.invalid> wrote:
> Hey everyone,
> 
> Thanks to folks who attended. I added my notes from the last sync. Please 
> feel free to add/correct if I missed anything.
> 
> Main points
> Highlights
> StreamingOffset for Structured Streaming in Spark
> New Actions API
> Spark procedure for partial import of existing tables
> Subsurface talks are online
> Call for papers is open at ApacheCon and Subsurface
> Releases
> 0.11.1
> Waiting for the fix on handling situations when the metastore fails during 
> commit (#2317).
> 0.12.0
> Should include Spark 3.1 support
> V2 format items should be included whenever possible but should not block the 
> release
> No new blockers
> Ideally, end of March
> Table corruption issue (#2317 <https://github.com/apache/iceberg/issues/2317>)
> We may corrupt tables if the metastore fails during commit and the commit 
> state is unknown. Iceberg may delete files that were actually committed.
> A lot of folks have seen this issue.
> Parth has shared some thoughts from a discussion they had internally here 
> <https://docs.google.com/document/d/1dN7gZwXmlI6Nl4RToAWgsMIsiJUCRSpfFfIL9Kr8s0k>.
> We can handle this issue in two phases:
> Don’t corrupt the table (Russell has a PR)
> Avoid duplicated results if operations are blindly retried (can be done in a 
> follow-up PR)
> Seems worth including the first part in 0.11.1
> V2 format
> Open points:
> Primary key or row id for upserts
> Propagating the sort order id for files on write
> Need more reviewers
> Encryption
> Multiple people expressed interested in data encryption.
> Existing work by John here <https://github.com/apache/iceberg/pull/1918>.
> Ideally, should leverage as much as possible of modular encryption in Parquet 
> 1.12 discussed here <https://github.com/apache/iceberg/issues/1413>.
> Agreed to start a thread on the dev list.
> ChachingCatalog issues (#2319 <https://github.com/apache/iceberg/issues/2319>)
> The current behavior leads to stale data if multiple sessions are used.
> No ideal solution due to Spark limitations. Agreed to discuss in the issue.
> Multi-table transactions
> Jacques has proposed an API here 
> <https://github.com/apache/iceberg/pull/1849> and is about to start working 
> on an implementation.
> Agreed to collaborate on the dev list. More eyes would be great.
> 
> The link to the doc: 
> https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg
>  
> <https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg>
> 
> Thanks,
> Anton
> 
> 
> -- 
> Chen Song
> 
> 
> 
> -- 
> Chen Song
>

Re: Iceberg sync notes - 10 March 2021

Reply via email to