Hi all,

I've been thinking about how we could make Iceberg tables more performant for streaming inserts. And I thought about using the manifests field as a buffer for manifest files before they are written to the manifest-list. This reduces the write amplification and simplifies the conflict resolution of concurrent writes.

I've written down my proposal here: https://lists.apache.org/thread/4cm9kc6pkmx5ol218z5yjk41gh9t28qg

And I thought I share it with you before you decide to deprecate the manifests field.

Kind regards,

Jan

On 22.11.24 11:55, Fokko Driesprong wrote:
Hey Ryan,

The goal of the deprecation is to avoid other implementations to produce it. PyIceberg for example, does not support this and I think it would be good to avoid having others (rust, go, etc) to support this. Regarding the removal, Amogh expressed the same concern on the PR <https://github.com/apache/iceberg/pull/11586#discussion_r1848789823>.

In my quest to make the Java implementation follow the spec as closely as possible, I noticed that we use a DummyFileIO to mimic a ManifestList. I ran into this when turning <https://github.com/apache/iceberg/pull/11626/files#r1853683623>503: added_snapshot_id <https://github.com/apache/iceberg/pull/11626/files#r1853683623>into a required field <https://github.com/apache/iceberg/pull/11626/files#r1853683623>. So the value is in removing paths, as Shezon pointed out. When removing support for the embedded manifest list, we can remove all that logic and keep the codebase nice and tidy.

It would be good to start the discussion of deprecating support for older formats at some point, however, for a V2 reader is it fairly easy to project V1 metadata as V2. Except when embedded manifests are being used, marking this kind of oddities as deprecated I think will enable readers to support reading older versions for a longer time. My suggestion would be to mark the field as deprecated and revisit the actual removal. I've marked it up for removal in Java 2.0 for now to give it enough time.

Kind regards,
Fokko



Op do 21 nov 2024 om 20:52 schreef rdb...@gmail.com <rdb...@gmail.com>:

    Can we safely deprecate and remove this? The manifest list is
    required in v2, but the spec has stated for a long time that v1
    tables can use |manifests| rather than a manifest list. It’s
    unlikely, but it would be valid for other implementations to
    produce it.

    I would understand if other implementations chose to fail tables
    that don’t have a manifest list to avoid adding code to handle
    |manifests|, but I don’t think that there’s much value in removing
    support from the Java implementation.

    Instead, what about discussing how to deprecate support for older
    format versions? That seems like the main issue here. Once the
    majority of implementations move to newer versions, we would like
    to deprecate the old ones.


    On Thu, Nov 21, 2024 at 11:01 AM Szehon Ho
    <szehon.apa...@gmail.com> wrote:

        +1, great to have less possible paths.

        Thanks
        Szehon

        On Thu, Nov 21, 2024 at 10:33 AM Steve Zhang
        <hongyue_zh...@apple.com.invalid> wrote:

            +1 to deprecate

            Thanks,
            Steve Zhang



            On Nov 19, 2024, at 3:32 AM, Fokko Driesprong
            <fo...@apache.org> wrote:

            Hi everyone,

            I would like to propose to deprecate embedded manifests
            <https://github.com/apache/iceberg/pull/11586>. This has
            been used before the manifest-list was introduced, but I
            don't think they are used since the project has been
            open-sourced, and it would be good to
            officially deprecate them from the spec. It is only
            supported by Iceberg Java today, and I haven't seen any
            requests for PyIceberg to add support for this.

            Any questions or concerns about deprecating the embedded
            manifests?

            Kind regards,
            Fokko Driesprong

Reply via email to