Hi all,
I've been thinking about how we could make Iceberg tables more
performant for streaming inserts. And I thought about using the
manifests field as a buffer for manifest files before they are written
to the manifest-list. This reduces the write amplification and
simplifies the conflict resolution of concurrent writes.
I've written down my proposal here:
https://lists.apache.org/thread/4cm9kc6pkmx5ol218z5yjk41gh9t28qg
And I thought I share it with you before you decide to deprecate the
manifests field.
Kind regards,
Jan
On 22.11.24 11:55, Fokko Driesprong wrote:
Hey Ryan,
The goal of the deprecation is to avoid other implementations to
produce it. PyIceberg for example, does not support this and I think
it would be good to avoid having others (rust, go, etc) to support
this. Regarding the removal, Amogh expressed the same concern on the
PR <https://github.com/apache/iceberg/pull/11586#discussion_r1848789823>.
In my quest to make the Java implementation follow the spec as closely
as possible, I noticed that we use a DummyFileIO to mimic a
ManifestList. I ran into this when turning
<https://github.com/apache/iceberg/pull/11626/files#r1853683623>503:
added_snapshot_id
<https://github.com/apache/iceberg/pull/11626/files#r1853683623>into a
required field
<https://github.com/apache/iceberg/pull/11626/files#r1853683623>. So
the value is in removing paths, as Shezon pointed out. When removing
support for the embedded manifest list, we can remove all that logic
and keep the codebase nice and tidy.
It would be good to start the discussion of deprecating support for
older formats at some point, however, for a V2 reader is it fairly
easy to project V1 metadata as V2. Except when embedded manifests are
being used, marking this kind of oddities as deprecated I think will
enable readers to support reading older versions for a longer time. My
suggestion would be to mark the field as deprecated and revisit the
actual removal. I've marked it up for removal in Java 2.0 for now to
give it enough time.
Kind regards,
Fokko
Op do 21 nov 2024 om 20:52 schreef rdb...@gmail.com <rdb...@gmail.com>:
Can we safely deprecate and remove this? The manifest list is
required in v2, but the spec has stated for a long time that v1
tables can use |manifests| rather than a manifest list. It’s
unlikely, but it would be valid for other implementations to
produce it.
I would understand if other implementations chose to fail tables
that don’t have a manifest list to avoid adding code to handle
|manifests|, but I don’t think that there’s much value in removing
support from the Java implementation.
Instead, what about discussing how to deprecate support for older
format versions? That seems like the main issue here. Once the
majority of implementations move to newer versions, we would like
to deprecate the old ones.
On Thu, Nov 21, 2024 at 11:01 AM Szehon Ho
<szehon.apa...@gmail.com> wrote:
+1, great to have less possible paths.
Thanks
Szehon
On Thu, Nov 21, 2024 at 10:33 AM Steve Zhang
<hongyue_zh...@apple.com.invalid> wrote:
+1 to deprecate
Thanks,
Steve Zhang
On Nov 19, 2024, at 3:32 AM, Fokko Driesprong
<fo...@apache.org> wrote:
Hi everyone,
I would like to propose to deprecate embedded manifests
<https://github.com/apache/iceberg/pull/11586>. This has
been used before the manifest-list was introduced, but I
don't think they are used since the project has been
open-sourced, and it would be good to
officially deprecate them from the spec. It is only
supported by Iceberg Java today, and I haven't seen any
requests for PyIceberg to add support for this.
Any questions or concerns about deprecating the embedded
manifests?
Kind regards,
Fokko Driesprong