Hi Erik,

Manifest lists serve two purposes:

   1. Reduce the amount of data tracked by the root metadata file
   2. Provide a rough index over manifest files to cut down on planning time

Manifests are reused to cut down on the amount of work required in a
commit, but by doing this we end up with a large number of manifests. That
list gets expensive if it is added to the root metadata, which includes all
valid snapshots. So moving that list to its own file allows Iceberg to
avoid reading the list unless it is used, and to avoid re-writing the list
for every valid snapshot.

As long as the list is written to its own file, we may as well write
metadata about partitions in each manifest so that we can skip manifests
that don’t match a query. That’s where the rough index comes from, and it
really does speed up queries. In fact, we have a new PR out to rewrite
manifests to take advantage of this:
https://github.com/apache/incubator-iceberg/pull/200/files

Does that answer your question?

On Mon, Jun 3, 2019 at 1:38 PM Erik Wright <erik.wri...@shopify.com.invalid>
wrote:

> In the process of following up on the "Updates/Deletes/Upserts" thread,
> I'm re-reading the table spec. I have a question about Manifest List files.
>
> If I understand correctly, the manifest list files are separate files that
> are created prior to attempting to commit a new snapshot. Each snapshot may
> have a single manifest list file. The manifest list file references _all_
> manifest files included in the snapshot.
>
> During a commit collision, two writers will produce new manifest list
> files. Assuming the two writes are compatible (one is append, one is
> replace, for example) the loser should be able to re-process their commit
> without rewriting any data files but will, nonetheless, need to rewrite
> their manifest list file in addition to rewriting their snapshot file.
>
> I was under the impression that it was a design objective to minimize the
> amount of work required in order to retry a commit. The inability to
> compose multiple manifest list files together seems like it adds mandatory
> read and write steps to almost every commit collision.
>
> Can someone clarify what the philosophy is with regards to minimizing the
> cost of commit retries?
>
> Thanks!
>
> -Erik
>


-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to