Great, thanks for the update! I'm glad that cleaning that up fixed the problem.
On Tue, Jan 26, 2021 at 11:46 AM Gautam <gautamkows...@gmail.com> wrote: > Hey Ryan & David, > I believe this change from you [1] indirectly achieves this. > David's issue is that every table.load() is instantiating one FS handle for > each snapshot, and in your change, by converting the File reference into > location string this is already a lazy read (in a way?). The version David > has been testing with was before this change. I believe with the change in > [1] the FS handles issue should be resolved. > > Please correct me if I'm wrong David/ Ryan. > > thanks and regards, > -Gautam. > > [1] - https://github.com/apache/iceberg/pull/1085/files > > On Tue, Jan 26, 2021 at 10:55 AM Ryan Blue <rb...@netflix.com.invalid> > wrote: > >> David, >> >> We could probably make it so that Snapshot instances are lazily created >> from the metadata file, but that would be a fairly large change. If you're >> interested, we can definitely make it happen. >> >> I agree with Vivekanand, though. A much easier solution is to reduce the >> number of snapshots in the table by expiring them. How long are you >> retaining snapshots? >> >> rb >> >> On Thu, Jan 21, 2021 at 8:11 PM Vivekanand Vellanki <vi...@dremio.com> >> wrote: >> >>> Just curious, what is the need to retain all those snapshots? >>> >>> I would assume that there is a mechanism to expire snapshots and delete >>> data/manifest files that are no longer required. >>> >>> On Thu, Jan 21, 2021 at 11:01 PM David Wilcox <dawil...@adobe.com.invalid> >>> wrote: >>> >>>> Hi Iceberg Devs, >>>> >>>> I have a process that reads Tables stored in Iceberg and processes >>>> them, many at a time. Lately, we've had problems with the scalability of >>>> our process due to the number of Hadoop Filesystem objects created inside >>>> Iceberg for Tables with many snapshots. These tables could have tens of >>>> thousands of snapshots inside, but I only want to read the latest snapshot. >>>> Inside the Hadoop Filesystem creation code that's called for every >>>> snapshot, there are process-level locks that end up locking up my whole >>>> process. >>>> >>>> Inside TableMetadataParser, it looks like we read in every snapshot >>>> even though the reader likely only wants one snapshot. This loop is what's >>>> responsible for locking up my process. >>>> >>>> https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320 >>>> >>>> I noticed that my process does not care about the whole snapshot list. >>>> My process only is interested in a particular snapshot -- just one of them. >>>> I'm interested in making a contribution so that the entire snapshot list is >>>> lazily calculated inside of TableMetadata where it's actually used. So, we >>>> would not create the Snapshot itself in TableMetadataParser, but instead >>>> likely would pass a SnapshotCreator in that could know how to create >>>> snapshots. We would pass all of the SnapshotCreators into TableMetadata >>>> which would create snapshots when needed. >>>> >>>> Would you be amenable to such a change? I want to make sure that you >>>> think that this sounds like something you would accept before I spend time >>>> coding it up. >>>> >>>> Any other thoughts on this? >>>> >>>> Thanks, >>>> David Wilcox >>>> >>> >> >> -- >> Ryan Blue >> Software Engineer >> Netflix >> > -- Ryan Blue Software Engineer Netflix