Also, we keep historical notes and a running agenda for the next sync in this doc: https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?pli=1#heading=h.z3dncl7gr8m1
Feel free to add topics for the next one, which will be on Wednesday, 21 July 2021 at 16:00 UTC. On Thu, Jul 1, 2021 at 9:42 AM Carl Steinbach <c...@apache.org> wrote: > Iceberg Community Meetings are open to everyone. To receive an invitation > to the next meeting, please join the iceberg-s...@googlegroups.com > <https://groups.google.com/g/iceberg-sync> list.Special thanks to Ryan > Blue for contributing most of these notes.Attendees: Anjali Norwood, > Badrul Chowdhury, Ben Mears, Dan Weeks, Gustavo Torres Torres, Jack Ye, > Karuppayya Rajendran, Kyle Bendickson, Parth Brahmbhatt, Russel Spitzer, > Ryan Blue, Sreeram Garlapati, Szehon Ho, Wing Yew Poon, Xinbin Huang, Yan > Yan, Carl Steinbach > > > - > > Highlights > - > > JDBC catalog was committed (Thanks, Ismail!) > - > > DynamoDB catalog was committed (Thanks, Jack!) > - > > Added predicate pushdown for partitions metadata table (Thanks, > Szehon!) > - > > Releases > - > > 0.12.0 > - > > New Actions API update > - Almost done with compaction. > - > > Need to make the old API deprecated (to confirm) > - > > Spark 3.1 support > - > > Recently rebased on master > https://github.com/apache/iceberg/pull/2512 > - No longer adds new modules, should be ready to commit. > - > > Feature-based or time-based release cycle? > - > > Carl: A time-based release cycle would be more predictable, not > slipping because of some feature that isn’t quite ready. This could > be > monthly or quarterly. > - > > Ryan: We already try not to hold back releases to get features > in because it is better to release more often than to let them slip. > But we > could be better about this. It’s important to continuously release > so that > changes get back out to contributors. > - > > The consensus was to discuss this on the dev list. It is a > promising idea. > - > > Iceberg 1.0? > - > > Carl: Semver is a lie, and there is a public perception around > 1.0 releases. Should we go ahead and target a 1.0 soon? > - > > Ryan: What do you mean that semver is a lie? > - > > Carl: If semver were followed carefully, most projects would be > on a major version in the 100s. Many things change, and the version > doesn’t > always reflect it. > - > > Ryan: That’s fair, but I think people still make downstream > decisions based on how those version numbers change. > - > > Jack: There is an expectation that breaking changes are signaled > by increasing the major version, or more accurately, that not > increasing > the major version indicates no major APIs are broken. > - > > Ryan: Also, bumping up to 1.0 is when people start expecting > more rigid enforcement of semver, even if it isn’t always done. If > we want > to update to 1.0 and/or drop semver, we should figure out our > guarantees > and document them clearly. And we should also prepare for more API > stability. Maybe add binary compatibility checks to the build. > - > > The consensus was to discuss this more on the dev list and > target a 1.0 for later this year with clear guidelines about API > compatibility. > - > > New slack community: apache-iceberg.slack.com > <https://communityinviter.com/apps/apache-iceberg/apache-iceberg-website> > - > > It’s easy to sign up for ASF Slack here: > https://s.apache.org/slack-invite > - > > No need for an independent Iceberg workspace. > - > > Any updates on the secondary index design? > - Miao and Guy weren’t at the meeting, so no update. > - Jack is going to look into this and help out. > - > > Github triage permissions for project contributors > - > > Carl opened an INFRA ticket for anyone with 2 or more contributions > - > > We Will see if infra can add everyone. > - > > Ref: INFRA-22026, INFRA-22031 > - > > Updating partitioning via Optimize/RewriteDataFiles > - > > Russell: We ran into an issue where compaction with multiple > partition specs will create many small files---planning groups files by > current spec, but writing can split data for the new spec. Since this > is a > rare event (unmerged data in an old spec), the solution is to merge > files > for the old spec separately. > - > > Ryan: sounds reasonable. > - > > Low-latency streaming > - > > Sreeram: We are trying to see how frequently we can commit to an > Iceberg table. Looking to get to commits every 1-2 seconds. One main > issue > we’ve found is that there are several metadata files written for every > commit: at least one manifest, the manifest list, and the metadata JSON > file. Plus, the metadata JSON file has many snapshots and gets quite > large > (3MB+) after a day of frequent commits. Is there a way to improve on how > the JSON file tracks snapshots? > - > > Ryan: There is space to improve this. I’ve thought about replacing > the JSON file with a database so that changes are more targeted and > don’t > require rewriting all of the information. This is supported by the > TableOperations API, which swaps TableMetadata objects. The JSON file > isn’t > really required by the implementation, although it has become popular > because it places all of the table metadata in the file system. So the > source of truth is entirely in the table’s files. > - > > Sreeram: What about writing diffs of the JSON files? We could, for > example, write a new snapshot as the only content in a new JSON file. > - > > Ryan: You could come up with a way to do that, but what you’d want > to avoid is needing to read lots of files to reconstruct the table’s > current state. If you’re trying to put together the history or snapshots > metadata tables, you don’t want to read the current file, its parent, > that > file’s parent, and so on. (That’s an easy design flaw to fall into.) > What > you should do instead is choose a base version and write all differences > against that. We’d need to define the format for JSON diffs. > - > > Ryan: And, I think it may be more useful to replace the JSON file > with a database because this could introduce more commit conflicts. When > the JSON file is periodically rewritten entirely to produce a new base > version, this operation may fail due to faster commits from other > writers. > That would be bad for a table. > - > > Ryan: What is the use case for this? 1-2 seconds is very frequent > and causes other issues, like small data files that need to be > compacted, > plus compaction commit retries because of frequent, ongoing commits. > - > > Sreeram: The idea is to see if we can replace Kafka with an Iceberg > table in workflows. > - > > Ryan: I don’t think that’s something you’d want to do. Iceberg just > isn’t designed for that kind of use case, and that is what Kafka does > really well. > - > > Kyle: Yeah, you’d definitely want to use Kafka for that. Iceberg is > good for long-term storage and isn’t a good replacement. > - > > Purge Behaviors > - > > Russell: Spark’s new API passes a purge flag through DROP TABLE. Do > we want to respect that flag? > - > > Ryan: Yes? Why wouldn’t we? > - > > Russell: Not everyone wants to purge data. > - > > Ryan: Agreed. Netflix wouldn’t do this because they often have to > restore tables. But that’s something that Netflix can turn off in their > catalog. For the built-in catalogs, we should probably support the > expected > behavior. > - > > Deduplication? As part of rewrite > - > > Kyle [I think]: What is the story around deduplication? Duplicate > records are a common problem. > - > > Ryan: Iceberg didn’t have one before, but now that we have a way to > identify records, thanks to Jack adding the row identifier fields, we > could > build something in this space. Maybe a background service that detects > duplicates and rewrites? But we would want to be careful here because it > could easily attempt to read an entire table if the partition spec is > not > aligned with the identifier fields. > > -- Ryan Blue Tabular