Hey Iceberg Community, I know there has been a discussion previously about exposing the sequence number on a ContentFile level, but if I'm not mistaken that conversation didn't end with a consensus. I found some relevant PRs that has been open for a while: https://github.com/apache/iceberg/pull/5760 https://github.com/apache/iceberg/pull/4769 (merged into the above PR)
The reason I bring this topic up is that we started investigating recently how to add read support for equality deletes to Impala. Apparently, implementation-wise we could save a lot of hassle if sequence numbers were exposed on a file level through the API, preferably somewhere around calling planFiles(). We could then have a virtual 'SEQUENCE_NUMBER' when scanning the data and delete files (separate scanners) and could easily filter the rows in the JOIN node that joins the rows from the data files with the ones from the delete files. (wouldn't go into more depth atm) With this mail I'd like to revive this conversation with the hope of eventually coming to a solution that satisfies all participants. I've been thinking of implementation choices we have to somehow provide sequence numbers for the files: - Extending ContentFile with sequence number: I checked the above PRs and IIUC the issue with this approach is that ContentFile is meant to be immutable and by the time they are created we don't have sequence numbers to populate the ContentFile object. - Extend FileScanTask with the file-level sequence numbers so after calling planFiles() we could retrieve these numbers via a new API call on the FileScanTask. There might be many other ways to implement this and I'd love to hear what people think and would be great to find a way that would help us out on Impala. Cheers, Gabor