Re: Ways To Alleviate Load For Tables With Many Snapshots

Gautam Tue, 26 Jan 2021 11:47:19 -0800

Hey Ryan & David,
             I believe  this change from you [1] indirectly achieves this.
David's issue is that every table.load() is instantiating one FS handle for
each snapshot, and in your change, by converting the File reference into
location string this is already a lazy read (in a way?). The version David
has been testing with was before this change. I believe with the change in
[1] the FS handles issue should be resolved.


Please correct me if I'm wrong David/ Ryan.

thanks and regards,
-Gautam.

[1] - https://github.com/apache/iceberg/pull/1085/files

On Tue, Jan 26, 2021 at 10:55 AM Ryan Blue <rb...@netflix.com.invalid>
wrote:

> David,
>
> We could probably make it so that Snapshot instances are lazily created
> from the metadata file, but that would be a fairly large change. If you're
> interested, we can definitely make it happen.
>
> I agree with Vivekanand, though. A much easier solution is to reduce the
> number of snapshots in the table by expiring them. How long are you
> retaining snapshots?
>
> rb
>
> On Thu, Jan 21, 2021 at 8:11 PM Vivekanand Vellanki <vi...@dremio.com>
> wrote:
>
>> Just curious, what is the need to retain all those snapshots?
>>
>> I would assume that there is a mechanism to expire snapshots and delete
>> data/manifest files that are no longer required.
>>
>> On Thu, Jan 21, 2021 at 11:01 PM David Wilcox <dawil...@adobe.com.invalid>
>> wrote:
>>
>>> Hi Iceberg Devs,
>>>
>>> I have a process that reads Tables stored in Iceberg and processes them,
>>> many at a time. Lately, we've had problems with the scalability of our
>>> process due to the number of Hadoop Filesystem objects created inside
>>> Iceberg for Tables with many snapshots. These tables could have tens of
>>> thousands of snapshots inside, but I only want to read the latest snapshot.
>>> Inside the Hadoop Filesystem creation code that's called for every
>>> snapshot, there are process-level locks that end up locking up my whole
>>> process.
>>>
>>> Inside TableMetadataParser, it looks like we read in every snapshot even
>>> though the reader likely only wants one snapshot. This loop is what's
>>> responsible for locking up my process.
>>>
>>> https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320
>>>
>>> I noticed that my process does not care about the whole snapshot list.
>>> My process only is interested in a particular snapshot -- just one of them.
>>> I'm interested in making a contribution so that the entire snapshot list is
>>> lazily calculated inside of TableMetadata where it's actually used. So, we
>>> would not create the Snapshot itself in TableMetadataParser, but instead
>>> likely would pass a SnapshotCreator in that could know how to create
>>> snapshots. We would pass all of the SnapshotCreators into TableMetadata
>>> which would create snapshots when needed.
>>>
>>> Would you be amenable to such a change? I want to make sure that you
>>> think that this sounds like something you would accept before I spend time
>>> coding it up.
>>>
>>> Any other thoughts on this?
>>>
>>> Thanks,
>>> David Wilcox
>>>
>>
>
> --
> Ryan Blue
> Software Engineer
> Netflix
>

Re: Ways To Alleviate Load For Tables With Many Snapshots

Reply via email to