Re: Sequence number for ContentFiles

Anton Okolnychyi Wed, 26 Apr 2023 11:15:15 -0700

My initial thinking is that exposing sequence numbers on ContentFile is 
preferable (we would get it for free in scan tasks). That said, I’ll need to 
see how complicated the implementation would be. Exposing it on ContentScanTask 
is a viable alternative. However, we already have a precedent for assigning 
specId in InheritableMetadata.


- Anton

> On Apr 26, 2023, at 10:41 AM, Ryan Blue <[email protected]> wrote:
> 
> Exposing sequence number makes sense for use cases like this. I also like the 
> idea of exposing it through FileScanTask. That might be easier than trying to 
> add it to ContentFile.
> 
> Anton, what do you think about adding it to FileScanTask?
> 
> On Wed, Apr 26, 2023 at 7:50 AM Anton Okolnychyi 
> <[email protected]> wrote:
> It is actually my bad not following up on that after #5913 and #6002. I’ll 
> take a look at #5760 referenced below by the end of this week. 
> 
> The plan was to expose sequence numbers on ContentFile. It is needed in a 
> number of use cases.
> 
> - Anton
> 
>> On Apr 26, 2023, at 4:55 AM, Gabor Kaszab <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hey Iceberg Community,
>> 
>> I know there has been a discussion previously about exposing the sequence 
>> number on a ContentFile level, but if I'm not mistaken that conversation 
>> didn't end with a consensus. I found some relevant PRs that has been open 
>> for a while:
>> https://github.com/apache/iceberg/pull/5760 
>> <https://github.com/apache/iceberg/pull/5760>
>> https://github.com/apache/iceberg/pull/4769 
>> <https://github.com/apache/iceberg/pull/4769> (merged into the above PR)
>> 
>> The reason I bring this topic up is that we started investigating recently 
>> how to add read support for equality deletes to Impala. Apparently, 
>> implementation-wise we could save a lot of hassle if sequence numbers were 
>> exposed on a file level through the API, preferably somewhere around calling 
>> planFiles(). We could then have a virtual 'SEQUENCE_NUMBER' when scanning 
>> the data and delete files (separate scanners) and could easily filter the 
>> rows in the JOIN node that joins the rows from the data files with the ones 
>> from the delete files. (wouldn't go into more depth atm)
>> 
>> With this mail I'd like to revive this conversation with the hope of 
>> eventually coming to a solution that satisfies all participants. I've been 
>> thinking of implementation choices we have to somehow provide sequence 
>> numbers for the files:
>> - Extending ContentFile with sequence number: I checked the above PRs and 
>> IIUC the issue with this approach is that ContentFile is meant to be 
>> immutable and by the time they are created we don't have sequence numbers to 
>> populate the ContentFile object.
>> - Extend FileScanTask with the file-level sequence numbers so after calling 
>> planFiles() we could retrieve these numbers via a new API call on the 
>> FileScanTask.
>> 
>> There might be many other ways to implement this and I'd love to hear what 
>> people think and would be great to find a way that would help us out on 
>> Impala.
>> 
>> Cheers,
>> Gabor
>> 
>> 
> 
> 
> 
> -- 
> Ryan Blue
> Tabular

Re: Sequence number for ContentFiles

Reply via email to