I created #7449 and added it to our 1.3.0 milestone to make sure it gets delivered.
- Anton > On Apr 26, 2023, at 1:59 PM, Steven Wu <stevenz...@gmail.com> wrote: > > > Will a clock skew cause any issues w.r.t. relying on the snapshot commit > > time? I think we allow a mismatch up to a minute in TableMetadata. > > This is probably not a problem. typically the max allowed misalignment is > much longer than 1 minute. > > > We also planned to expose file sequence number (different from data > > sequence number). I believe you could lookup snapshot using that info. > > That would work too. > > > On Wed, Apr 26, 2023 at 1:10 PM Jack Ye <yezhao...@gmail.com > <mailto:yezhao...@gmail.com>> wrote: > +1 for using file sequence number. This work has been discussed for a long > time but never got picked up, would be great if someone can drive it to > completion. > > -Jack > > On Wed, Apr 26, 2023 at 12:03 PM Anton Okolnychyi > <aokolnyc...@apple.com.invalid> wrote: > Will a clock skew cause any issues w.r.t. relying on the snapshot commit > time? I think we allow a mismatch up to a minute in TableMetadata. > > We also planned to expose file sequence number (different from data sequence > number). I believe you could lookup snapshot using that info. > > https://iceberg.apache.org/spec/#manifest-entry-fields > <https://iceberg.apache.org/spec/#manifest-entry-fields> > > - Anton > >> On Apr 26, 2023, at 11:52 AM, Steven Wu <stevenz...@gmail.com >> <mailto:stevenz...@gmail.com>> wrote: >> >> piggyback on this thread since we are discussing exposing more metadata in >> ContentFile or FileScanTask. Flink source watermark alignment can >> potentially leverage the snapshot timestamp (when data files are >> committed/appended to the table). Is it reasonable to expose some snapshot >> metadata in the FileScanTask? >> >> This can help the Flink job to ensure two (or more) sources are proceeding >> at similar paces. Some use cases may require column stats (min-max values) >> for the watermark alignment. Some use cases can leverage snapshot timestamps >> for the alignment purpose. >> >> On Wed, Apr 26, 2023 at 11:15 AM Anton Okolnychyi >> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> wrote: >> My initial thinking is that exposing sequence numbers on ContentFile is >> preferable (we would get it for free in scan tasks). That said, I’ll need to >> see how complicated the implementation would be. Exposing it on >> ContentScanTask is a viable alternative. However, we already have a >> precedent for assigning specId in InheritableMetadata. >> >> - Anton >> >>> On Apr 26, 2023, at 10:41 AM, Ryan Blue <b...@tabular.io >>> <mailto:b...@tabular.io>> wrote: >>> >>> Exposing sequence number makes sense for use cases like this. I also like >>> the idea of exposing it through FileScanTask. That might be easier than >>> trying to add it to ContentFile. >>> >>> Anton, what do you think about adding it to FileScanTask? >>> >>> On Wed, Apr 26, 2023 at 7:50 AM Anton Okolnychyi >>> <aokolnyc...@apple.com.invalid <mailto:aokolnyc...@apple.com.invalid>> >>> wrote: >>> It is actually my bad not following up on that after #5913 and #6002. I’ll >>> take a look at #5760 referenced below by the end of this week. >>> >>> The plan was to expose sequence numbers on ContentFile. It is needed in a >>> number of use cases. >>> >>> - Anton >>> >>>> On Apr 26, 2023, at 4:55 AM, Gabor Kaszab <gaborkas...@apache.org >>>> <mailto:gaborkas...@apache.org>> wrote: >>>> >>>> Hey Iceberg Community, >>>> >>>> I know there has been a discussion previously about exposing the sequence >>>> number on a ContentFile level, but if I'm not mistaken that conversation >>>> didn't end with a consensus. I found some relevant PRs that has been open >>>> for a while: >>>> https://github.com/apache/iceberg/pull/5760 >>>> <https://github.com/apache/iceberg/pull/5760> >>>> https://github.com/apache/iceberg/pull/4769 >>>> <https://github.com/apache/iceberg/pull/4769> (merged into the above PR) >>>> >>>> The reason I bring this topic up is that we started investigating recently >>>> how to add read support for equality deletes to Impala. Apparently, >>>> implementation-wise we could save a lot of hassle if sequence numbers were >>>> exposed on a file level through the API, preferably somewhere around >>>> calling planFiles(). We could then have a virtual 'SEQUENCE_NUMBER' when >>>> scanning the data and delete files (separate scanners) and could easily >>>> filter the rows in the JOIN node that joins the rows from the data files >>>> with the ones from the delete files. (wouldn't go into more depth atm) >>>> >>>> With this mail I'd like to revive this conversation with the hope of >>>> eventually coming to a solution that satisfies all participants. I've been >>>> thinking of implementation choices we have to somehow provide sequence >>>> numbers for the files: >>>> - Extending ContentFile with sequence number: I checked the above PRs and >>>> IIUC the issue with this approach is that ContentFile is meant to be >>>> immutable and by the time they are created we don't have sequence numbers >>>> to populate the ContentFile object. >>>> - Extend FileScanTask with the file-level sequence numbers so after >>>> calling planFiles() we could retrieve these numbers via a new API call on >>>> the FileScanTask. >>>> >>>> There might be many other ways to implement this and I'd love to hear what >>>> people think and would be great to find a way that would help us out on >>>> Impala. >>>> >>>> Cheers, >>>> Gabor >>>> >>>> >>> >>> >>> >>> -- >>> Ryan Blue >>> Tabular >> >