Hi Ryan, On Tue, Mar 9, 2021 at 5:54 AM Ryan Murray <rym...@gmail.com> wrote:
> Hey Edgar, Cheng Pan, > > I am not sure if you are aware of project nessie > <https://projectnessie.org>? It _may_ suit your needs. Nessie applies > git-like functionality to iceberg tables (in this case most useful are > branches and tags). > Thanks for the suggestion. I did look at nessie and it looks really cool. > > In effect you would be pivoting the snapshot partition into the table > itself and using nessie tags to represent the previous table snapshots. You > could create a tag for each database snapshot with the date the snapshot > was taken and the`main` branch would then receive your half hour updates. I > think the major issue is that you would lose the `ds` partition column and > have to use the `select * from tablename@tagname` syntax that nessie > supports to query a specific `ds`, however it would provide you with the > `snapshot-tag` concept you suggested above. A potential extra benefit is > that all tables would be under the same tag so you would in effect have the > same tag for the set of tables rather than an iceberg snapshot id per table. > Yeah, I think the main issue with this workflow is either maintaining the `ds` way to query the tables - while in Iceberg we try to hide partitioning from the user - or make it so that it's not too much of a disruptive migration for the user. For instance, if Iceberg supported the snapshot-tag we could use the Spark procedure to set the current snapshot using a tag, which may be easier to use than the snapshot-id that right now it expects. I think in general, it's a bit hard to track snapshot-ids in Iceberg specially for making it easy to use when referring to them. All snapshot-ids may not be relevant to users and an external mapping would be needed to track them on certain specific points. I think while not having the full features of nessie, snapshot-tags by themselves would still be useful for folks using their own catalog/tools or vanilla Iceberg. Thanks, -- Edgar R