Will a clock skew cause any issues w.r.t. relying on the snapshot commit time? I think we allow a mismatch up to a minute in TableMetadata.
We also planned to expose file sequence number (different from data sequence number). I believe you could lookup snapshot using that info. https://iceberg.apache.org/spec/#manifest-entry-fields <https://iceberg.apache.org/spec/#manifest-entry-fields> - Anton > On Apr 26, 2023, at 11:52 AM, Steven Wu <stevenz...@gmail.com> wrote: > > piggyback on this thread since we are discussing exposing more metadata in > ContentFile or FileScanTask. Flink source watermark alignment can potentially > leverage the snapshot timestamp (when data files are committed/appended to > the table). Is it reasonable to expose some snapshot metadata in the > FileScanTask? > > This can help the Flink job to ensure two (or more) sources are proceeding at > similar paces. Some use cases may require column stats (min-max values) for > the watermark alignment. Some use cases can leverage snapshot timestamps for > the alignment purpose. > > On Wed, Apr 26, 2023 at 11:15 AM Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > My initial thinking is that exposing sequence numbers on ContentFile is > preferable (we would get it for free in scan tasks). That said, I’ll need to > see how complicated the implementation would be. Exposing it on > ContentScanTask is a viable alternative. However, we already have a precedent > for assigning specId in InheritableMetadata. > > - Anton > >> On Apr 26, 2023, at 10:41 AM, Ryan Blue <b...@tabular.io >> <mailto:b...@tabular.io>> wrote: >> >> Exposing sequence number makes sense for use cases like this. I also like >> the idea of exposing it through FileScanTask. That might be easier than >> trying to add it to ContentFile. >> >> Anton, what do you think about adding it to FileScanTask? >> >> On Wed, Apr 26, 2023 at 7:50 AM Anton Okolnychyi >> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> wrote: >> It is actually my bad not following up on that after #5913 and #6002. I’ll >> take a look at #5760 referenced below by the end of this week. >> >> The plan was to expose sequence numbers on ContentFile. It is needed in a >> number of use cases. >> >> - Anton >> >>> On Apr 26, 2023, at 4:55 AM, Gabor Kaszab <gaborkas...@apache.org >>> <mailto:gaborkas...@apache.org>> wrote: >>> >>> Hey Iceberg Community, >>> >>> I know there has been a discussion previously about exposing the sequence >>> number on a ContentFile level, but if I'm not mistaken that conversation >>> didn't end with a consensus. I found some relevant PRs that has been open >>> for a while: >>> https://github.com/apache/iceberg/pull/5760 >>> <https://github.com/apache/iceberg/pull/5760> >>> https://github.com/apache/iceberg/pull/4769 >>> <https://github.com/apache/iceberg/pull/4769> (merged into the above PR) >>> >>> The reason I bring this topic up is that we started investigating recently >>> how to add read support for equality deletes to Impala. Apparently, >>> implementation-wise we could save a lot of hassle if sequence numbers were >>> exposed on a file level through the API, preferably somewhere around >>> calling planFiles(). We could then have a virtual 'SEQUENCE_NUMBER' when >>> scanning the data and delete files (separate scanners) and could easily >>> filter the rows in the JOIN node that joins the rows from the data files >>> with the ones from the delete files. (wouldn't go into more depth atm) >>> >>> With this mail I'd like to revive this conversation with the hope of >>> eventually coming to a solution that satisfies all participants. I've been >>> thinking of implementation choices we have to somehow provide sequence >>> numbers for the files: >>> - Extending ContentFile with sequence number: I checked the above PRs and >>> IIUC the issue with this approach is that ContentFile is meant to be >>> immutable and by the time they are created we don't have sequence numbers >>> to populate the ContentFile object. >>> - Extend FileScanTask with the file-level sequence numbers so after calling >>> planFiles() we could retrieve these numbers via a new API call on the >>> FileScanTask. >>> >>> There might be many other ways to implement this and I'd love to hear what >>> people think and would be great to find a way that would help us out on >>> Impala. >>> >>> Cheers, >>> Gabor >>> >>> >> >> >> >> -- >> Ryan Blue >> Tabular >