Re: Ways To Alleviate Load For Tables With Many Snapshots

Ryan Blue Tue, 26 Jan 2021 12:13:34 -0800

Great, thanks for the update! I'm glad that cleaning that up fixed the
problem.


On Tue, Jan 26, 2021 at 11:46 AM Gautam <gautamkows...@gmail.com> wrote:

> Hey Ryan & David,
>              I believe  this change from you [1] indirectly achieves this.
> David's issue is that every table.load() is instantiating one FS handle for
> each snapshot, and in your change, by converting the File reference into
> location string this is already a lazy read (in a way?). The version David
> has been testing with was before this change. I believe with the change in
> [1] the FS handles issue should be resolved.
>
> Please correct me if I'm wrong David/ Ryan.
>
> thanks and regards,
> -Gautam.
>
> [1] - https://github.com/apache/iceberg/pull/1085/files
>
> On Tue, Jan 26, 2021 at 10:55 AM Ryan Blue <rb...@netflix.com.invalid>
> wrote:
>
>> David,
>>
>> We could probably make it so that Snapshot instances are lazily created
>> from the metadata file, but that would be a fairly large change. If you're
>> interested, we can definitely make it happen.
>>
>> I agree with Vivekanand, though. A much easier solution is to reduce the
>> number of snapshots in the table by expiring them. How long are you
>> retaining snapshots?
>>
>> rb
>>
>> On Thu, Jan 21, 2021 at 8:11 PM Vivekanand Vellanki <vi...@dremio.com>
>> wrote:
>>
>>> Just curious, what is the need to retain all those snapshots?
>>>
>>> I would assume that there is a mechanism to expire snapshots and delete
>>> data/manifest files that are no longer required.
>>>
>>> On Thu, Jan 21, 2021 at 11:01 PM David Wilcox <dawil...@adobe.com.invalid>
>>> wrote:
>>>
>>>> Hi Iceberg Devs,
>>>>
>>>> I have a process that reads Tables stored in Iceberg and processes
>>>> them, many at a time. Lately, we've had problems with the scalability of
>>>> our process due to the number of Hadoop Filesystem objects created inside
>>>> Iceberg for Tables with many snapshots. These tables could have tens of
>>>> thousands of snapshots inside, but I only want to read the latest snapshot.
>>>> Inside the Hadoop Filesystem creation code that's called for every
>>>> snapshot, there are process-level locks that end up locking up my whole
>>>> process.
>>>>
>>>> Inside TableMetadataParser, it looks like we read in every snapshot
>>>> even though the reader likely only wants one snapshot. This loop is what's
>>>> responsible for locking up my process.
>>>>
>>>> https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320
>>>>
>>>> I noticed that my process does not care about the whole snapshot list.
>>>> My process only is interested in a particular snapshot -- just one of them.
>>>> I'm interested in making a contribution so that the entire snapshot list is
>>>> lazily calculated inside of TableMetadata where it's actually used. So, we
>>>> would not create the Snapshot itself in TableMetadataParser, but instead
>>>> likely would pass a SnapshotCreator in that could know how to create
>>>> snapshots. We would pass all of the SnapshotCreators into TableMetadata
>>>> which would create snapshots when needed.
>>>>
>>>> Would you be amenable to such a change? I want to make sure that you
>>>> think that this sounds like something you would accept before I spend time
>>>> coding it up.
>>>>
>>>> Any other thoughts on this?
>>>>
>>>> Thanks,
>>>> David Wilcox
>>>>
>>>
>>
>> --
>> Ryan Blue
>> Software Engineer
>> Netflix
>>
>

-- 
Ryan Blue
Software Engineer
Netflix

Re: Ways To Alleviate Load For Tables With Many Snapshots

Reply via email to