Hi Edgar,

I think we are facing the similar issue, and I raised a GitHub Issue #2231 is 
pretty like your `snapshot-tag` workaround.


Currently, we build ETL workflows based on Hive tables, and to achieve version 
control ability of data, we add a top level partition which named ts to all 
tables. And assign ts a specific value when trigger the workflow, then we can 
get the expected version of data of all tables with one ts.
I know I can use snapshot_id which is auto generated to fetch the specific 
version of data in iceberg table, and it means if I want to identify the 
snapshots of all tables involved in one workflow, every time run the workflow, 
I need to persist each table's snapshot_id when the table updated so I can use 
it later.
Is there an approach to assign the snapshot a snapshot_name besides 
snapshot_id? So we can track the snapshots of relevant tables such as generated 
in the same workflow in a convenient way.


Thanks,
Cheng Pan


On Mar 9, 2021 02:43,Edgar Rodriguez<edgar.rodrig...@airbnb.com.INVALID> wrote:


Hi folks,

I’d like to request some feedback on how to use Iceberg to approach a use case 
we have, that I believe some other folks could be facing, since this was a 
pattern usually followed with Hive tables.

Use case:
1. We used to have database table snapshots exported daily at 0 UTC. Each day a 
new partition is created with a materialized snapshot (e.g. ds=2021-02-01, 
ds=2021-02-02, ...)
2. We have a lot of queries written against this legacy structure.
3. We would like to start migrating to Iceberg by writing a table snapshot and 
periodically committing mutations (e.g. every half hour).
4. We are trying to retain the legacy interface (`ds` partition as a snapshot) 
to support the myriad of existing queries, which sometimes target multiple 
snapshots at the same time so that old queries continue to work, while new 
queries are written directly against Iceberg tables using time travel.

Issues:
An issue I see moving this use case to Iceberg is on the interface, as many 
users already have queries using the `ds` partitioning column to use the 
snapshot - also note that in this approach users NEED to know specifically that 
they can only query these tables with a `ds` filter, otherwise they could get 
duplicated rows. One thought we had to solve this was to use a thin wrapper, 
for instance in Hive a custom table InputFormat that takes the filter 
expression (with the `ds`) and maps it to a snapshot using a JSON config file 
(which holds the snapshot-id to ds mapping); and something similar for Spark. 
This solution is very custom to the use case, and makes a lot of assumptions, 
but I guess the idea is to present this specific interface to users while using 
Iceberg - however, this could be a transitioning phase until user queries are 
fully migrated to using snapshots directly.

I still think Iceberg would be a good candidate to avoid duplicating data and 
simplify users' requirement on knowing the partitioning and its implied meaning 
before querying the table.

How are other folks with the same use case solving this with Iceberg?


----

On Iceberg snapshots:
I know that in Iceberg we want to abstract partitioning as much as possible 
from the user, since this is really powerful. My initial thought is to use the 
natively supported table snapshots and time travel in Iceberg. However, it’s 
not straightforward for users to use a snapshot-id, and snapshots may not 
exactly correspond to the data at a given timestamp, only to the point on when 
the change was applied to the table, e.g. If I want the table data for 
2021-01-01 00:00:00 UTC the commit that was created for that particular 
cut-over was done in 2021-01-01 06:00:00 UTC, so using timestamp is not 
straightforward either. 
Would it make sense to introduce a `snapshot-tag` concept that could be used to 
refer to a particular snapshot? I guess we could add it in the Snapshot 
summary, but there’s no way to use that tag instead of the ID to refer to the 
snapshot. This would allow us to tag specific snapshots and let users use the 
tags to query the snapshot, simplifying a bit the migration. Also, we’d need to 
make sure the tags are unique, same as the snapshot ids. In a way I think of 
this as something similar to Git, where snapshot-id is akin to commit hash and 
snapshot-tag is similar to a git tag. I think this would simplify the way to 
use snapshots on queries.

I’m happy to hear other approaches. Thanks for reading and the comments in 
advance!

Best,

-- 

Edgar R

Reply via email to