Sure, I guessed you were asking about the number of manifest files rather than entries. There's always a tradeoff, some aspects being:
- More manifest files => better predicate pushdown (skip more manifest files during query), and less chance for concurrency conflict (which is two transaction trying to modify same manifest file, which leads to retry). - Less manifest files => metadata queries (like show partitions) can be faster. Each of these is a large topic itself that might be too big to go into here :) For us, we find the benefit for more manifest file is not as important as making the metadata query fast for our users. So we have tuned commit.manifest.target-size-bytes to be a few times than the default. We try to keep the manifest file count to be tens or hundreds for any table, we find if there are thousands, then a 'show partition' query takes a long time. We do need to do periodic RewriteManifest to keep the table in this shape (as we have too many commits), and also to use 'commit.manifest.min-count-to-merge' and 'commit.manifest-merge.enabled' to do the merge on commit to keep the table in this shape. Hope that helps, Szehon On Fri, Jan 7, 2022 at 1:10 PM g. g. grey <g.g.g...@gmail.com> wrote: > Hi Szehon, > > Thanks. My apologies; I was too loose in my wording. I'll try to use the > terms from the spec. > > I was asking about the number of total manifest files, specifically the > number of `manifest_file` structs that are found in the manifest-list file. > > It sounds like the "commit.manifest.target-size-bytes" controls the target > size when we merge small manifest files, which is great to know we can > configure, as it will clearly have an impact on the number of > `manifest_file` structs. > > Is there a general order-of-magnitude target number of `manifest_file` > structs? Presumably that would dictate when one would want to merge > manifest files and/or data files. > > Thanks again! > ggg > > > On Fri, Jan 7, 2022 at 11:41 AM Szehon Ho <szehon.apa...@gmail.com> wrote: > >> Hi, >> >> The manifest entries are one per data file or delete file, so depends how >> many data files/delete files your table has. Number of files is controlled >> mostly by the parallelism of the job that writes the table, though there >> are Iceberg RewriteDataFile utilities that can compact as well (as in your >> link). >> >> The number of manifest files is another topic, controlled by >> "commit.manifest.target-size-bytes" >> (but should not affect the number of total manifest entries). >> >> Hope that helps, >> Szehon >> >> On Fri, Jan 7, 2022 at 9:39 AM g. g. grey <g.g.g...@gmail.com> wrote: >> >>> Hi folks, >>> >>> I am just getting started with Iceberg and I'm trying to build up some >>> intuition for how large the metadata will become for large, active tables. >>> Specifically, what is the order of magnitude of manifest entries that I >>> should reasonably expect in a manifest-list file? Is there a particular >>> range that is ideal and aimed for when cleaning up/maintaining a table? >>> >>> I found the maintenance page <https://iceberg.apache.org/#maintenance/>, >>> but I'm hoping to find rules-of-thumb based on peoples' experience with >>> using iceberg. >>> >>> Thanks! If I've missed the info somewhere, a simple pointer would be >>> great. >>> ggg >>> >>