Just curious, what is the need to retain all those snapshots? I would assume that there is a mechanism to expire snapshots and delete data/manifest files that are no longer required.
On Thu, Jan 21, 2021 at 11:01 PM David Wilcox <dawil...@adobe.com.invalid> wrote: > Hi Iceberg Devs, > > I have a process that reads Tables stored in Iceberg and processes them, > many at a time. Lately, we've had problems with the scalability of our > process due to the number of Hadoop Filesystem objects created inside > Iceberg for Tables with many snapshots. These tables could have tens of > thousands of snapshots inside, but I only want to read the latest snapshot. > Inside the Hadoop Filesystem creation code that's called for every > snapshot, there are process-level locks that end up locking up my whole > process. > > Inside TableMetadataParser, it looks like we read in every snapshot even > though the reader likely only wants one snapshot. This loop is what's > responsible for locking up my process. > > https://github.com/apache/iceberg/blob/330f1520ce497153f7a6e9a80a22035ff9f6aa32/core/src/main/java/org/apache/iceberg/TableMetadataParser.java#L320 > > I noticed that my process does not care about the whole snapshot list. My > process only is interested in a particular snapshot -- just one of them. > I'm interested in making a contribution so that the entire snapshot list is > lazily calculated inside of TableMetadata where it's actually used. So, we > would not create the Snapshot itself in TableMetadataParser, but instead > likely would pass a SnapshotCreator in that could know how to create > snapshots. We would pass all of the SnapshotCreators into TableMetadata > which would create snapshots when needed. > > Would you be amenable to such a change? I want to make sure that you think > that this sounds like something you would accept before I spend time coding > it up. > > Any other thoughts on this? > > Thanks, > David Wilcox >