Re: [NOTES] 23 June 2021 Iceberg Community Meeting

Ryan Blue Thu, 01 Jul 2021 09:50:43 -0700

Also, we keep historical notes and a running agenda for the next sync in
this doc:
https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?pli=1#heading=h.z3dncl7gr8m1


Feel free to add topics for the next one, which will be on Wednesday, 21
July 2021 at 16:00 UTC.

On Thu, Jul 1, 2021 at 9:42 AM Carl Steinbach <c...@apache.org> wrote:

> Iceberg Community Meetings are open to everyone. To receive an invitation
> to the next meeting, please join the iceberg-s...@googlegroups.com
> <https://groups.google.com/g/iceberg-sync> list.Special thanks to Ryan
> Blue for contributing most of these notes.Attendees: Anjali Norwood,
> Badrul Chowdhury, Ben Mears, Dan Weeks, Gustavo Torres Torres, Jack Ye,
> Karuppayya Rajendran, Kyle Bendickson, Parth Brahmbhatt, Russel Spitzer,
> Ryan Blue, Sreeram Garlapati, Szehon Ho, Wing Yew Poon, Xinbin Huang, Yan
> Yan, Carl Steinbach
>
>
>    -
>
>    Highlights
>    -
>
>       JDBC catalog was committed (Thanks, Ismail!)
>       -
>
>       DynamoDB catalog was committed (Thanks, Jack!)
>       -
>
>       Added predicate pushdown for partitions metadata table (Thanks,
>       Szehon!)
>       -
>
>    Releases
>    -
>
>       0.12.0
>       -
>
>          New Actions API update
>          - Almost done with compaction.
>          -
>
>             Need to make the old API deprecated (to confirm)
>             -
>
>          Spark 3.1 support
>          -
>
>             Recently rebased on master
>             https://github.com/apache/iceberg/pull/2512
>             - No longer adds new modules, should be ready to commit.
>          -
>
>       Feature-based or time-based release cycle?
>       -
>
>          Carl: A time-based release cycle would be more predictable, not
>          slipping because of some feature that isn’t quite ready. This could 
> be
>          monthly or quarterly.
>          -
>
>          Ryan: We already try not to hold back releases to get features
>          in because it is better to release more often than to let them slip. 
> But we
>          could be better about this. It’s important to continuously release 
> so that
>          changes get back out to contributors.
>          -
>
>          The consensus was to discuss this on the dev list. It is a
>          promising idea.
>          -
>
>       Iceberg 1.0?
>       -
>
>          Carl: Semver is a lie, and there is a public perception around
>          1.0 releases. Should we go ahead and target a 1.0 soon?
>          -
>
>          Ryan: What do you mean that semver is a lie?
>          -
>
>          Carl: If semver were followed carefully, most projects would be
>          on a major version in the 100s. Many things change, and the version 
> doesn’t
>          always reflect it.
>          -
>
>          Ryan: That’s fair, but I think people still make downstream
>          decisions based on how those version numbers change.
>          -
>
>          Jack: There is an expectation that breaking changes are signaled
>          by increasing the major version, or more accurately, that not 
> increasing
>          the major version indicates no major APIs are broken.
>          -
>
>          Ryan: Also, bumping up to 1.0 is when people start expecting
>          more rigid enforcement of semver, even if it isn’t always done. If 
> we want
>          to update to 1.0 and/or drop semver, we should figure out our 
> guarantees
>          and document them clearly. And we should also prepare for more API
>          stability. Maybe add binary compatibility checks to the build.
>          -
>
>          The consensus was to discuss this more on the dev list and
>          target a 1.0 for later this year with clear guidelines about API
>          compatibility.
>          -
>
>    New slack community: apache-iceberg.slack.com
>    <https://communityinviter.com/apps/apache-iceberg/apache-iceberg-website>
>    -
>
>       It’s easy to sign up for ASF Slack here:
>       https://s.apache.org/slack-invite
>       -
>
>       No need for an independent Iceberg workspace.
>       -
>
>    Any updates on the secondary index design?
>    - Miao and Guy weren’t at the meeting, so no update.
>    - Jack is going to look into this and help out.
>    -
>
>    Github triage permissions for project contributors
>    -
>
>       Carl opened an INFRA ticket for anyone with 2 or more contributions
>       -
>
>       We Will see if infra can add everyone.
>       -
>
>       Ref: INFRA-22026, INFRA-22031
>       -
>
>    Updating partitioning via Optimize/RewriteDataFiles
>    -
>
>       Russell: We ran into an issue where compaction with multiple
>       partition specs will create many small files---planning groups files by
>       current spec, but writing can split data for the new spec. Since this 
> is a
>       rare event (unmerged data in an old spec), the solution is to merge 
> files
>       for the old spec separately.
>       -
>
>       Ryan: sounds reasonable.
>       -
>
>    Low-latency streaming
>    -
>
>       Sreeram: We are trying to see how frequently we can commit to an
>       Iceberg table. Looking to get to commits every 1-2 seconds. One main 
> issue
>       we’ve found is that there are several metadata files written for every
>       commit: at least one manifest, the manifest list, and the metadata JSON
>       file. Plus, the metadata JSON file has many snapshots and gets quite 
> large
>       (3MB+) after a day of frequent commits. Is there a way to improve on how
>       the JSON file tracks snapshots?
>       -
>
>       Ryan: There is space to improve this. I’ve thought about replacing
>       the JSON file with a database so that changes are more targeted and 
> don’t
>       require rewriting all of the information. This is supported by the
>       TableOperations API, which swaps TableMetadata objects. The JSON file 
> isn’t
>       really required by the implementation, although it has become popular
>       because it places all of the table metadata in the file system. So the
>       source of truth is entirely in the table’s files.
>       -
>
>       Sreeram: What about writing diffs of the JSON files? We could, for
>       example, write a new snapshot as the only content in a new JSON file.
>       -
>
>       Ryan: You could come up with a way to do that, but what you’d want
>       to avoid is needing to read lots of files to reconstruct the table’s
>       current state. If you’re trying to put together the history or snapshots
>       metadata tables, you don’t want to read the current file, its parent, 
> that
>       file’s parent, and so on. (That’s an easy design flaw to fall into.) 
> What
>       you should do instead is choose a base version and write all differences
>       against that. We’d need to define the format for JSON diffs.
>       -
>
>       Ryan: And, I think it may be more useful to replace the JSON file
>       with a database because this could introduce more commit conflicts. When
>       the JSON file is periodically rewritten entirely to produce a new base
>       version, this operation may fail due to faster commits from other 
> writers.
>       That would be bad for a table.
>       -
>
>       Ryan: What is the use case for this? 1-2 seconds is very frequent
>       and causes other issues, like small data files that need to be 
> compacted,
>       plus compaction commit retries because of frequent, ongoing commits.
>       -
>
>       Sreeram: The idea is to see if we can replace Kafka with an Iceberg
>       table in workflows.
>       -
>
>       Ryan: I don’t think that’s something you’d want to do. Iceberg just
>       isn’t designed for that kind of use case, and that is what Kafka does
>       really well.
>       -
>
>       Kyle: Yeah, you’d definitely want to use Kafka for that. Iceberg is
>       good for long-term storage and isn’t a good replacement.
>       -
>
>    Purge Behaviors
>    -
>
>       Russell: Spark’s new API passes a purge flag through DROP TABLE. Do
>       we want to respect that flag?
>       -
>
>       Ryan: Yes? Why wouldn’t we?
>       -
>
>       Russell: Not everyone wants to purge data.
>       -
>
>       Ryan: Agreed. Netflix wouldn’t do this because they often have to
>       restore tables. But that’s something that Netflix can turn off in their
>       catalog. For the built-in catalogs, we should probably support the 
> expected
>       behavior.
>       -
>
>    Deduplication? As part of rewrite
>    -
>
>       Kyle [I think]: What is the story around deduplication? Duplicate
>       records are a common problem.
>       -
>
>       Ryan: Iceberg didn’t have one before, but now that we have a way to
>       identify records, thanks to Jack adding the row identifier fields, we 
> could
>       build something in this space. Maybe a background service that detects
>       duplicates and rewrites? But we would want to be careful here because it
>       could easily attempt to read an entire table if the partition spec is 
> not
>       aligned with the identifier fields.
>
>

-- 
Ryan Blue
Tabular

Re: [NOTES] 23 June 2021 Iceberg Community Meeting

Reply via email to