Re: Migrating legacy snapshot daily Hive table concept to Iceberg

Ryan Murray Tue, 09 Mar 2021 02:54:34 -0800

Hey Edgar, Cheng Pan,

I am not sure if you are aware of project nessie <https://projectnessie.org>?
It _may_ suit your needs. Nessie applies git-like functionality to iceberg
tables (in this case most useful are branches and tags).


In effect you would be pivoting the snapshot partition into the table
itself and using nessie tags to represent the previous table snapshots. You
could create a tag for each database snapshot with the date the snapshot
was taken and the`main` branch would then receive your half hour updates. I
think the major issue is that you would lose the `ds` partition column and
have to use the `select * from tablename@tagname` syntax that nessie
supports to query a specific `ds`, however it would provide you with the
`snapshot-tag` concept you suggested above. A potential extra benefit is
that all tables would be under the same tag so you would in effect have the
same tag for the set of tables rather than an iceberg snapshot id per table.

Happy to discuss more if this path sounds like something you may be
interested in.

Best,
Ryan

On Tue, Mar 9, 2021 at 2:59 AM 379377944 <379377...@qq.com> wrote:

> Hi Edgar,
>
>
> I think we are facing the similar issue, and I raised a GitHub Issue #2231
> <https://github.com/apache/iceberg/issues/2231> is pretty like your
> `snapshot-tag` workaround.
>
>
> Currently, we build ETL workflows based on Hive tables, and to achieve
> version control ability of data, we add a top level partition which named
> ts to all tables. And assign ts a specific value when trigger the
> workflow, then we can get the expected version of data of all tables with
> one ts.
>
> I know I can use snapshot_id which is auto generated to fetch the
> specific version of data in iceberg table, and it means if I want to
> identify the snapshots of all tables involved in one workflow, every time
> run the workflow, I need to persist each table's snapshot_id when the
> table updated so I can use it later.
>
> Is there an approach to assign the snapshot a snapshot_name besides
> snapshot_id? So we can track the snapshots of relevant tables such as
> generated in the same workflow in a convenient way.
>
>
> Thanks,
> Cheng Pan
>
> On Mar 9, 2021 02:43，Edgar Rodriguez<edgar.rodrig...@airbnb.com.INVALID>
> wrote：
>
> Hi folks,
>
> I’d like to request some feedback on how to use Iceberg to approach a use
> case we have, that I believe some other folks could be facing, since this
> was a pattern usually followed with Hive tables.
>
> Use case:
> 1. We used to have database table snapshots exported daily at 0 UTC. Each
> day a new partition is created with a materialized snapshot (e.g.
> ds=2021-02-01, ds=2021-02-02, ...)
> 2. We have a lot of queries written against this legacy structure.
> 3. We would like to start migrating to Iceberg by writing a table snapshot
> and periodically committing mutations (e.g. every half hour).
> 4. We are trying to retain the legacy interface (`ds` partition as a
> snapshot) to support the myriad of existing queries, which sometimes target
> multiple snapshots at the same time so that old queries continue to work,
> while new queries are written directly against Iceberg tables using time
> travel.
>
> Issues:
> An issue I see moving this use case to Iceberg is on the interface, as
> many users already have queries using the `ds` partitioning column to use
> the snapshot - also note that in this approach users NEED to know
> specifically that they can only query these tables with a `ds` filter,
> otherwise they could get duplicated rows. One thought we had to solve this
> was to use a thin wrapper, for instance in Hive a custom table InputFormat
> that takes the filter expression (with the `ds`) and maps it to a snapshot
> using a JSON config file (which holds the snapshot-id to ds mapping); and
> something similar for Spark. This solution is very custom to the use case,
> and makes a lot of assumptions, but I guess the idea is to present this
> specific interface to users while using Iceberg - however, this could be a
> transitioning phase until user queries are fully migrated to using
> snapshots directly.
>
> I still think Iceberg would be a good candidate to avoid duplicating data
> and simplify users' requirement on knowing the partitioning and its implied
> meaning before querying the table.
>
> How are other folks with the same use case solving this with Iceberg?
>
> ----
>
> On Iceberg snapshots:
> I know that in Iceberg we want to abstract partitioning as much as
> possible from the user, since this is really powerful. My initial thought
> is to use the natively supported table snapshots and time travel in
> Iceberg. However, it’s not straightforward for users to use a snapshot-id,
> and snapshots may not exactly correspond to the data at a given timestamp,
> only to the point on when the change was applied to the table, e.g. If I
> want the table data for 2021-01-01 00:00:00 UTC the commit that was created
> for that particular cut-over was done in 2021-01-01 06:00:00 UTC, so using
> timestamp is not straightforward either.
> Would it make sense to introduce a `snapshot-tag` concept that could be
> used to refer to a particular snapshot? I guess we could add it in the
> Snapshot summary, but there’s no way to use that tag instead of the ID to
> refer to the snapshot. This would allow us to tag specific snapshots and
> let users use the tags to query the snapshot, simplifying a bit the
> migration. Also, we’d need to make sure the tags are unique, same as the
> snapshot ids. In a way I think of this as something similar to Git, where
> snapshot-id is akin to commit hash and snapshot-tag is similar to a git
> tag. I think this would simplify the way to use snapshots on queries.
>
> I’m happy to hear other approaches. Thanks for reading and the comments in
> advance!
>
> Best,
> --
> Edgar R
>

Re: Migrating legacy snapshot daily Hive table concept to Iceberg

Reply via email to