Hi Dabby, I think your assessment is right.
- Table metadata isn’t versioned with snapshots and is a good idea for table-level configuration. It sounds like what you need is additional information about a snapshot, so table properties don’t make sense. - Using upper and lower bounds is ideal, but if you don’t have bounds for the timestamp you’re trying to track, then it isn’t an option. - That leaves storing extra metadata in the Snapshot summary. That’s what this feature is for. We use these properties to track how Flink checkpoints map to Iceberg snapshots for exactly-once commits. If you’re going to store extra information in the Snapshot summary, I recommend keeping it small. Streaming use cases tend to have lots of snapshots, so the data can add up quickly if you’re adding lots of information to the summary. rb On Tue, Jan 28, 2020 at 7:28 AM Dabeluchi Ndubisi <dabeluchi.ndub...@shopify.com.invalid> wrote: > Hi, > > We would like to store snapshot metadata that is necessary for > producing/consuming incremental data. An example of this is the maximum > value of an event timeline that we have processed so far, so that we know > where to read from next. > > Some of the possible options that we have discovered so far are: > > 1) to store such metadata in the TableMetadata properties, but this is > already advised against in the Iceberg specification. > > 2) to use the max of the upper bounds of an event timestamp column tracked > by the Datafiles in a snapshot, but this wouldn’t be accurate as we can > have cases where the max value of an event timestamp column is less than > the event time for which data spans (especially for sparse datasets). > > 3) to store such metadata in the summary property of the snapshot. This > seems to be the most promising approach, but we wanted to know if there are > any restrictions on the maximum length of information that can be stored in > the summary property of a Snapshot. A downside to this approach is that the > summary property of the snapshot only holds Strings, so we will have to > always convert all data to Strings in order to use this. > > If none of the above is the most suitable place to store such information, > please could anyone advise any other approaches they have taken to solve > this? > > Thanks, > Dabby -- Ryan Blue Software Engineer Netflix