Sequence number for ContentFiles

Gabor Kaszab Wed, 26 Apr 2023 04:55:43 -0700

Hey Iceberg Community,

I know there has been a discussion previously about exposing the sequence
number on a ContentFile level, but if I'm not mistaken that conversation
didn't end with a consensus. I found some relevant PRs that has been open
for a while:
https://github.com/apache/iceberg/pull/5760
https://github.com/apache/iceberg/pull/4769 (merged into the above PR)


The reason I bring this topic up is that we started investigating recently
how to add read support for equality deletes to Impala. Apparently,
implementation-wise we could save a lot of hassle if sequence numbers were
exposed on a file level through the API, preferably somewhere around
calling planFiles(). We could then have a virtual 'SEQUENCE_NUMBER' when
scanning the data and delete files (separate scanners) and could easily
filter the rows in the JOIN node that joins the rows from the data files
with the ones from the delete files. (wouldn't go into more depth atm)

With this mail I'd like to revive this conversation with the hope of
eventually coming to a solution that satisfies all participants. I've been
thinking of implementation choices we have to somehow provide sequence
numbers for the files:
- Extending ContentFile with sequence number: I checked the above PRs and
IIUC the issue with this approach is that ContentFile is meant to be
immutable and by the time they are created we don't have sequence numbers
to populate the ContentFile object.
- Extend FileScanTask with the file-level sequence numbers so after calling
planFiles() we could retrieve these numbers via a new API call on the
FileScanTask.

There might be many other ways to implement this and I'd love to hear what
people think and would be great to find a way that would help us out on
Impala.

Cheers,
Gabor

Sequence number for ContentFiles

Reply via email to