My intention was not to rewrite significant pieces of existing & working code. The experiment with *TableMetadata* was for example to show an alternative approach to writing simple hand-written builders (since we get the builder for free). #3580 <https://github.com/apache/iceberg/pull/3580> just converted *TableMetadata* to be an immutable class but still using the existing *TableMetadataParser*.
I'm not saying we should rewrite existing hand-written serializers, I'm rather saying that we *could* if the serializers for example don't have any complex logic for backwards compatibility in them and don't have to adhere to a strict spec. Also my statement was rather referring to the hand-written JSON serializers/deserializers that are currently being created for the *REST-based Catalog* here <https://github.com/apache/iceberg/pull/3424/files#diff-cf0695b97e3a4acd04c24f7712a8ca35a6c7d81005a5b27f97401ba2a93952f9R54-R64>, where things like *TableIdentifier/NameSpace/PartitionSpec/SortOrder*/... need to be handled for JSON. I was talking to Kyle on Slack about this and wanted to help out with this while he's working on the other pieces of the REST-based catalog. Thus I thought it's a good idea to handle this via Immutables and get the JSON ser/de stuff for free. Thanks for mentioning the *FileScanTask*. I'll check and see what can be done there. Eduard On Thu, Dec 9, 2021 at 12:08 AM Ryan Blue <b...@tabular.io> wrote: > This could be valuable for some things, but I'm skeptical that we would > want to rewrite significant pieces of existing code this way. > > For example, you mentioned being able to "get rid of hand-written JSON > serializers/deserializers". Automatic serialization is fiddly. You end up > needing to scatter customizations around the code to annotate fields and > add conversions. Then those annotations become the source of truth for > serialization. If someone doesn't know what annotation to add to a field or > just doesn't consider serialization you can serialize either more than you > intended or something incorrect. This doesn't fit well with Iceberg, where > the serialization to JSON is documented in the spec and is strict. > > In addition to making it hard to maintain a JSON format that aligns with > the spec, this change isn't really needed. The existing parsers are already > written and don't change much. We don't have to worry about updating > dependencies and the implications to runtime Jars. But this would introduce > new (and unnecessary) risk to integrations. > > To make that risk more concrete, consider object serialization in Spark. > Today, we support Java, JSON, and Kryo serialization. There are places > where we've needed to adapt the implementations and stop storing immutable > collections for Kryo, or else Kryo can't reconstruct objects. So instead of > storing an ImmutableMap, we store a HashMap and wrap it to be unmodifiable. > If we were to rewrite classes that get serialized, we'd need to ensure > we're not causing regressions where classes become incompatible with > serialization that currently works. > > I don't think I would opt to spend time converting existing classes that > we already know work, when those already have working JSON parsers, Java > and Kryo serialization, and builders. Introducing big changes to achieve > the same functionality is a pretty significant risk and should have a > bigger upside than reducing lines of code, in my opinion. > > There may be uses other than converting existing code and there could be > low risk existing classes that we can convert. Using this to support JSON > serialization for `FileScanTask` is a good example because we need JSON > serialization for Trino, but it doesn't need to conform to a spec. > > Ryan > > On Wed, Dec 8, 2021 at 4:37 AM Eduard Tudenhoefner <edu...@dremio.com> > wrote: > >> Hello everyone, >> >> I was curious what people think about introducing/using the Immutables >> <https://immutables.github.io/> library in Iceberg. The Immutables >> library (Apache License 2.0) works via annotation processing, where an >> abstract value class is annotated with *@Value.Immutable* to generate an >> *immutable >> implementation class*. >> >> The main motivation for the Immutables library is outlined below: >> >> - Use true immutable objects that are type-safe, thread-safe, >> null-safe >> - Get builder classes for free >> - reduce amount of boiler-plate code >> - Get JSON serialization/deserialization for free (this will also be >> helpful for the REST-based Catalog >> <https://github.com/apache/iceberg/pull/3424>) since Immutable >> objects are serialization ready (including JSON and its binary forms) >> - Supports lazy, derived and optional attributes >> - Immutable objects are constructed once, in a consistent state, and >> can be safely shared >> - Will fail if mandatory attributes are missing >> - Cannot be sneakily modified when passed to other code >> - Immutable objects are naturally thread-safe and can therefore be >> safely shared among threads >> - No excessive copying >> - No excessive synchronization >> - Object definitions are pleasant to write and read >> - No boilerplate setter and getters >> - No IDE-generated hashCode, equals and toString methods required. >> However, custom implementation for these methods can still be written >> >> >> I did a small PoC in #3688 <https://github.com/apache/iceberg/pull/3688> >> (TableIdentifier/Namespace) >> and #3580 <https://github.com/apache/iceberg/pull/3580> (TableMetadata) >> but notice that both PRs still don't show the full potential of Immutables >> as they only convert objects to being Immutable + using the generated >> builders + doing some fairly easy validation checks. >> The real benefit gets more obvious when we can get rid of hand-written >> JSON serializers/deserializers. >> >> >> Eduard >> > > > -- > Ryan Blue > Tabular >