Re: Ways To Alleviate Load For Tables With Many Snapshots

David Wilcox Tue, 26 Jan 2021 12:09:55 -0800

Whoops -- I had the link a bit wrong for the github issue. Sorry.
https://github.com/apache/iceberg/issues/2130
[https://avatars.githubusercontent.com/u/47359?s=400&v=4]<https://github.com/apache/iceberg/issues/2130>
Ways To Alleviate Load For Tables With Many Snapshots · Issue #2130 · 
apache/iceberg<https://github.com/apache/iceberg/issues/2130>
I have a process that reads Tables stored in Iceberg and processes them, many 
at a time. Lately, we&#39;ve had problems with the scalability of our process 
due to the number of Hadoop Filesystem ob...
github.com

________________________________
From: David Wilcox <[email protected]>
Sent: Tuesday, January 26, 2021 12:59 PM
To: Gautam <[email protected]>; Iceberg Dev List 
<[email protected]>; Ryan Blue <[email protected]>
Cc: Gautam Kowshik <[email protected]>; Xabriel Collazo Mojica 
<[email protected]>; Grp-XAD <[email protected]>
Subject: Re: Ways To Alleviate Load For Tables With Many Snapshots

Ahh. This is better. I hadn't gotten any emails from anyone on this list 
earlier! This is refreshing!

Yes. I did a change myself, but then noticed and communicated with people 
inside Adobe (Gautam included) that your change, Ryan fixed my problem. Thanks 
for that! 😄

I also filed the issue here:
https://github.com/apache/iceberg/issues?q=is%3Aissue
[https://avatars.githubusercontent.com/u/47359?s=400&v=4]<https://github.com/apache/iceberg/issues?q=is%3Aissue>
apache/iceberg<https://github.com/apache/iceberg/issues?q=is%3Aissue>
Apache Iceberg. Contribute to apache/iceberg development by creating an account 
on GitHub.
github.com

As far as I'm concerned, we can consider this issue solved, I think.

Thanks!
________________________________
From: Gautam <[email protected]>
Sent: Tuesday, January 26, 2021 12:56 PM
To: Iceberg Dev List <[email protected]>; Ryan Blue <[email protected]>
Cc: Gautam Kowshik <[email protected]>; Xabriel Collazo Mojica 
<[email protected]>; Grp-XAD <[email protected]>; David Wilcox 
<[email protected]>
Subject: Re: Ways To Alleviate Load For Tables With Many Snapshots

+ dawilcox

On Tue, Jan 26, 2021 at 11:46 AM Gautam 
<[email protected]<mailto:[email protected]>> wrote:
Hey Ryan & David,
             I believe  this change from you [1] indirectly achieves this. 
David's issue is that every table.load() is instantiating one FS handle for 
each snapshot, and in your change, by converting the File reference into 
location string this is already a lazy read (in a way?). The version David has 
been testing with was before this change. I believe with the change in [1] the 
FS handles issue should be resolved.

Please correct me if I'm wrong David/ Ryan.

thanks and regards,
-Gautam.

[1] - 
https://github.com/apache/iceberg/pull/1085/files<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fpull%2F1085%2Ffiles&data=04%7C01%7Cdawilcox%40adobe.com%7Cbc1c3031465d4d9c4cb408d8c2348a2b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637472878209170459%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=CfLCzSomjf1jbWuBqcBkhA6z4ux8Up0bCmwQl2afFsk%3D&reserved=0>

On Tue, Jan 26, 2021 at 10:55 AM Ryan Blue <[email protected]> wrote:
David,

We could probably make it so that Snapshot instances are lazily created from 
the metadata file, but that would be a fairly large change. If you're 
interested, we can definitely make it happen.

I agree with Vivekanand, though. A much easier solution is to reduce the number 
of snapshots in the table by expiring them. How long are you retaining 
snapshots?

rb

On Thu, Jan 21, 2021 at 8:11 PM Vivekanand Vellanki 
<[email protected]<mailto:[email protected]>> wrote:
Just curious, what is the need to retain all those snapshots?

I would assume that there is a mechanism to expire snapshots and delete 
data/manifest files that are no longer required.

On Thu, Jan 21, 2021 at 11:01 PM David Wilcox <[email protected]> 
wrote:
Hi Iceberg Devs,

I have a process that reads Tables stored in Iceberg and processes them, many 
at a time. Lately, we've had problems with the scalability of our process due 
to the number of Hadoop Filesystem objects created inside Iceberg for Tables 
with many snapshots. These tables could have tens of thousands of snapshots 
inside, but I only want to read the latest snapshot. Inside the Hadoop 
Filesystem creation code that's called for every snapshot, there are 
process-level locks that end up locking up my whole process.

Inside TableMetadataParser, it looks like we read in every snapshot even though 
the reader likely only wants one snapshot. This loop is what's responsible for 
locking up my process.
https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320<https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Ficeberg%2Fblob%2F330f1520ce497153f7a6e9a80a22035ff9f6aa32%2Fcore%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Ficeberg%2FTableMetadataParser.java%23L320&data=04%7C01%7Cdawilcox%40adobe.com%7Cbc1c3031465d4d9c4cb408d8c2348a2b%7Cfa7b1b5a7b34438794aed2c178decee1%7C0%7C0%7C637472878209170459%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=6DdXO2jqzcvaeArY6gdpXh5%2BXlSu4gODOzBSbLqUS38%3D&reserved=0>

I noticed that my process does not care about the whole snapshot list. My 
process only is interested in a particular snapshot -- just one of them. I'm 
interested in making a contribution so that the entire snapshot list is lazily 
calculated inside of TableMetadata where it's actually used. So, we would not 
create the Snapshot itself in TableMetadataParser, but instead likely would 
pass a SnapshotCreator in that could know how to create snapshots. We would 
pass all of the SnapshotCreators into TableMetadata which would create 
snapshots when needed.

Would you be amenable to such a change? I want to make sure that you think that 
this sounds like something you would accept before I spend time coding it up.

Any other thoughts on this?

Thanks,
David Wilcox

--
Ryan Blue
Software Engineer
Netflix

Re: Ways To Alleviate Load For Tables With Many Snapshots

Reply via email to