Hi folks, I’d like to request some feedback on how to use Iceberg to approach a use case we have, that I believe some other folks could be facing, since this was a pattern usually followed with Hive tables.
Use case: 1. We used to have database table snapshots exported daily at 0 UTC. Each day a new partition is created with a materialized snapshot (e.g. ds=2021-02-01, ds=2021-02-02, ...) 2. We have a lot of queries written against this legacy structure. 3. We would like to start migrating to Iceberg by writing a table snapshot and periodically committing mutations (e.g. every half hour). 4. We are trying to retain the legacy interface (`ds` partition as a snapshot) to support the myriad of existing queries, which sometimes target multiple snapshots at the same time so that old queries continue to work, while new queries are written directly against Iceberg tables using time travel. Issues: An issue I see moving this use case to Iceberg is on the interface, as many users already have queries using the `ds` partitioning column to use the snapshot - also note that in this approach users NEED to know specifically that they can only query these tables with a `ds` filter, otherwise they could get duplicated rows. One thought we had to solve this was to use a thin wrapper, for instance in Hive a custom table InputFormat that takes the filter expression (with the `ds`) and maps it to a snapshot using a JSON config file (which holds the snapshot-id to ds mapping); and something similar for Spark. This solution is very custom to the use case, and makes a lot of assumptions, but I guess the idea is to present this specific interface to users while using Iceberg - however, this could be a transitioning phase until user queries are fully migrated to using snapshots directly. I still think Iceberg would be a good candidate to avoid duplicating data and simplify users' requirement on knowing the partitioning and its implied meaning before querying the table. How are other folks with the same use case solving this with Iceberg? ---- On Iceberg snapshots: I know that in Iceberg we want to abstract partitioning as much as possible from the user, since this is really powerful. My initial thought is to use the natively supported table snapshots and time travel in Iceberg. However, it’s not straightforward for users to use a snapshot-id, and snapshots may not exactly correspond to the data at a given timestamp, only to the point on when the change was applied to the table, e.g. If I want the table data for 2021-01-01 00:00:00 UTC the commit that was created for that particular cut-over was done in 2021-01-01 06:00:00 UTC, so using timestamp is not straightforward either. Would it make sense to introduce a `snapshot-tag` concept that could be used to refer to a particular snapshot? I guess we could add it in the Snapshot summary, but there’s no way to use that tag instead of the ID to refer to the snapshot. This would allow us to tag specific snapshots and let users use the tags to query the snapshot, simplifying a bit the migration. Also, we’d need to make sure the tags are unique, same as the snapshot ids. In a way I think of this as something similar to Git, where snapshot-id is akin to commit hash and snapshot-tag is similar to a git tag. I think this would simplify the way to use snapshots on queries. I’m happy to hear other approaches. Thanks for reading and the comments in advance! Best, -- Edgar R