David, We could probably make it so that Snapshot instances are lazily created from the metadata file, but that would be a fairly large change. If you're interested, we can definitely make it happen.
I agree with Vivekanand, though. A much easier solution is to reduce the number of snapshots in the table by expiring them. How long are you retaining snapshots? rb On Thu, Jan 21, 2021 at 8:11 PM Vivekanand Vellanki <vi...@dremio.com> wrote: > Just curious, what is the need to retain all those snapshots? > > I would assume that there is a mechanism to expire snapshots and delete > data/manifest files that are no longer required. > > On Thu, Jan 21, 2021 at 11:01 PM David Wilcox <dawil...@adobe.com.invalid> > wrote: > >> Hi Iceberg Devs, >> >> I have a process that reads Tables stored in Iceberg and processes them, >> many at a time. Lately, we've had problems with the scalability of our >> process due to the number of Hadoop Filesystem objects created inside >> Iceberg for Tables with many snapshots. These tables could have tens of >> thousands of snapshots inside, but I only want to read the latest snapshot. >> Inside the Hadoop Filesystem creation code that's called for every >> snapshot, there are process-level locks that end up locking up my whole >> process. >> >> Inside TableMetadataParser, it looks like we read in every snapshot even >> though the reader likely only wants one snapshot. This loop is what's >> responsible for locking up my process. >> >> https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320 >> >> I noticed that my process does not care about the whole snapshot list. My >> process only is interested in a particular snapshot -- just one of them. >> I'm interested in making a contribution so that the entire snapshot list is >> lazily calculated inside of TableMetadata where it's actually used. So, we >> would not create the Snapshot itself in TableMetadataParser, but instead >> likely would pass a SnapshotCreator in that could know how to create >> snapshots. We would pass all of the SnapshotCreators into TableMetadata >> which would create snapshots when needed. >> >> Would you be amenable to such a change? I want to make sure that you >> think that this sounds like something you would accept before I spend time >> coding it up. >> >> Any other thoughts on this? >> >> Thanks, >> David Wilcox >> > -- Ryan Blue Software Engineer Netflix