Meeting Minutes from 01/19 Iceberg Sync

Sam Redai Wed, 19 Jan 2022 12:28:09 -0800

Hello Iceberg Community,

Below you can find the minutes and video recording from our Iceberg Sync
that took place on *January 19th, 9am-10am PT*.

Always remember, anyone can join the discussion so feel free to share the
Iceberg-Sync <https://groups.google.com/g/iceberg-sync> google group with
anyone who is seeking an invite. The notes and the agenda are posted
in the live
doc
<https://docs.google.com/document/d/1YuGhUdukLP5gGiqCbk0A5_Wifqe2CZWgOd3TbhY3UQg/edit?usp=drive_web>
that's
also attached to the meeting invitation and it's a good place to add items
as you see fit so we can discuss them in the next community sync.

Minutes:

Meeting Recording ⭕
<https://drive.google.com/file/d/1adbE5ichvfM3_-v2mcH16k1NMiCNjW1O/view>

Top of the Meeting Highlights

Spark 3.2 merge-on-read DELETE is in! (Thanks, Anton!)
-

Added dynamic file filter metrics in Spark 3.0 and 3.1 (Thanks, Chen!)
-

Iceberg-spark metrics are now visible in the Spark UI
-

Candidate files that were scanned
-

Matching files that went on into subsequent stages such as merge
or update plans
-

This is useful to check out as an example of how to easily add
metrics to some of our custom plans and have them show up in
the Spark UI
-

Added initial OpenAPI spec for REST catalogs (Thanks, Kyle!)
-

The spec for the REST catalog has been shared and lots of community
members have been looking closely at it.
-

This will have a big impact, particularly on SaaS implementations, so
the more feedback from the community on this the better.

0.13.0 Release Candidate

Spark 3.2 merge-on-read DELETE is in! (Thanks, Anton!)
-

Sort order in relation to copy-on-write merge might have a few remaining
items but it feels ready to merge now and improve in upcoming releases if
necessary.
-

Parquet support for the older 2-level list style: If we could get that
into the 0.13.0 release, that would be great although it’s not a blocker
and we can follow up with a quick subsequent release. PR #3774
<https://github.com/apache/iceberg/pull/3774>
-

Checksum validation on S3 requests is another relevant open pull
request. This is also not a blocker for 0.13.0 but is close enough that we
may want to include it. PR #3813
<https://github.com/apache/iceberg/pull/3813>
-

Nessie-related changes for the 0.13.0 release are merged and ready.

Python

A lot of interest in the community and among our user base
-

Some ongoing discussions on how to get review cycles shortened
-

Types PR was merged yesterday and there are a couple of current PRs that
are very near merge-ready
-

Open invitation to anyone interested in participating in the python
refactoring efforts (either contributing or reviewing) so check out
the Iceberg
Python Sync <https://groups.google.com/g/iceberg-python-sync> if you’re
interested!

Java 1.0 API

Managing delete files is a big request for the 1.0 API. We’ve recently
added a delete file threshold to the rewrite files action which drives how
many delete files will remain. You can also set this to 0 to rewrite all
delete files. This suffices for removing delete compactions from the
remaining high priority items list.
-

Other potential remaining items to consider
-

Expiring delete files that are no longer used by the current
snapshots for a boost to storage efficiency. This functionality could
simply be added to the expire snapshot action.
-

No other high-priority maintenance operations seem to be remaining for
the 1.0 release
-

More discussion is needed around what should be public/private and how
we’ll evolve the API over time.
-

Currently, the target is for the next Java API release to be 1.0

Other High Priority Items:

Alternative File Formats

Ashish has been taking a look at this and it seems very doable. Current
formats (ORC, Parquet, etc.) share very similar interfaces that can inform
the abstraction.

Encryption

This is a high-priority item that’s in demand and can hopefully get in
soon. There are a few PR’s open and community review is welcome.

CDC

Not necessarily high priority but there’s a strong desire to get this in
this calendar year. Refreshing materialized views in spark is highly
dependent on this functionality. Yufei is working on a design doc.

Tagging Snapshots and Searching Snapshots by Tag

This is useful for allowing tags to be exposed to users for easily
retrieving a previous table state. In particular, tagging snapshots on an
hourly or daily basis for future convenient lookups.

Z-Ordering

Significant progress has been made on Z-Ordering and one of the current
discussions is around its implementation with respects to the magnitude
problem. Specifically, normalizing values may require metrics/stats on
column distributions.
-

In its current state, Apple’s implementation doesn’t include any stats
on distribution but is valuable as an initial implementation to get in to
unblock the feature and get it out to the community while solving the
magnitude problem in a subsequent release.

REST Catalog

There’s a lot of interest in the REST catalog and there are at least a
few areas of the community that are ready to use it immediately, i.e.
Nessie and Apple
-

Future talks about pushing more work into the server implementations:
-

Should planning be a part of the REST catalog API?
-

Pagination mechanism for accessing all of the snapshots in a table?

Relative Paths (design doc
<https://docs.google.com/document/d/1RDEjJAVEXg1csRzyzTuM634L88vvI0iDHNQQK3kOVR0/edit#heading=h.hxmtkjthp8hm>
)

High priority for a few community members, primarily Apple.
-

Many use cases for this feature across the community (surrounding data
recovery)

Views

LinkedIn and Netflix are interested in this effort and picking it back up
-

A question is how we’d be able to integrate this back into Spark after
it’s added on the Iceberg side

Secondary Indexes

A priority for the Athena team but if other members find this as a high
priority please reach out and share details around your use cases.
-

Might be useful for cases where you can’t order the data as it’s written
to the table. Also, cases where you need to index by an additional column
that’s not included in your sort order.

Handling Wide Tables (many columns)

In such a scenario, the metadata files can get very large given that we
store metrics for each column. The MetricsConfig is key for optimizing this.
-

See table configuration <https://iceberg.apache.org/configuration/> for
more details
-

In particular, setting `write.metadata.metrics.default` to none or
tuning this at the column level using `
write.metadata.metrics.column.col1`

Thank you all for another great meeting!

Meeting Minutes from 01/19 Iceberg Sync

Reply via email to