I created #3697 <https://github.com/apache/iceberg/pull/3697> that gives an example of how subclasses of *FileScanTask* can be converted to Immutables.
On Thu, Dec 9, 2021 at 1:53 AM Jack Ye <yezhao...@gmail.com> wrote: > For JSON serialization of objects such as FileScanTask, I have been > experimenting about the best way to achieve that. So far the best way I see > is actually still to use a fully customized JSON parser like what we did > for TableMetadata. This is because there are many sub-classes of the > FileScanTask interface, and serialization has to be defined all the way > down to some very nested objects like Expression. I don't see how that > could be easily done by Immutables or Jackson annotations, but I am not > very experienced in Java annotation processor frameworks. I can publish a > PR for more discussions around that area. > > -Jack > > On Wed, Dec 8, 2021 at 3:08 PM Ryan Blue <b...@tabular.io> wrote: > >> This could be valuable for some things, but I'm skeptical that we would >> want to rewrite significant pieces of existing code this way. >> >> For example, you mentioned being able to "get rid of hand-written JSON >> serializers/deserializers". Automatic serialization is fiddly. You end up >> needing to scatter customizations around the code to annotate fields and >> add conversions. Then those annotations become the source of truth for >> serialization. If someone doesn't know what annotation to add to a field or >> just doesn't consider serialization you can serialize either more than you >> intended or something incorrect. This doesn't fit well with Iceberg, where >> the serialization to JSON is documented in the spec and is strict. >> >> In addition to making it hard to maintain a JSON format that aligns with >> the spec, this change isn't really needed. The existing parsers are already >> written and don't change much. We don't have to worry about updating >> dependencies and the implications to runtime Jars. But this would introduce >> new (and unnecessary) risk to integrations. >> >> To make that risk more concrete, consider object serialization in Spark. >> Today, we support Java, JSON, and Kryo serialization. There are places >> where we've needed to adapt the implementations and stop storing immutable >> collections for Kryo, or else Kryo can't reconstruct objects. So instead of >> storing an ImmutableMap, we store a HashMap and wrap it to be unmodifiable. >> If we were to rewrite classes that get serialized, we'd need to ensure >> we're not causing regressions where classes become incompatible with >> serialization that currently works. >> >> I don't think I would opt to spend time converting existing classes that >> we already know work, when those already have working JSON parsers, Java >> and Kryo serialization, and builders. Introducing big changes to achieve >> the same functionality is a pretty significant risk and should have a >> bigger upside than reducing lines of code, in my opinion. >> >> There may be uses other than converting existing code and there could be >> low risk existing classes that we can convert. Using this to support JSON >> serialization for `FileScanTask` is a good example because we need JSON >> serialization for Trino, but it doesn't need to conform to a spec. >> >> Ryan >> >> On Wed, Dec 8, 2021 at 4:37 AM Eduard Tudenhoefner <edu...@dremio.com> >> wrote: >> >>> Hello everyone, >>> >>> I was curious what people think about introducing/using the Immutables >>> <https://immutables.github.io/> library in Iceberg. The Immutables >>> library (Apache License 2.0) works via annotation processing, where an >>> abstract value class is annotated with *@Value.Immutable* to generate >>> an *immutable implementation class*. >>> >>> The main motivation for the Immutables library is outlined below: >>> >>> - Use true immutable objects that are type-safe, thread-safe, >>> null-safe >>> - Get builder classes for free >>> - reduce amount of boiler-plate code >>> - Get JSON serialization/deserialization for free (this will also be >>> helpful for the REST-based Catalog >>> <https://github.com/apache/iceberg/pull/3424>) since Immutable >>> objects are serialization ready (including JSON and its binary forms) >>> - Supports lazy, derived and optional attributes >>> - Immutable objects are constructed once, in a consistent state, and >>> can be safely shared >>> - Will fail if mandatory attributes are missing >>> - Cannot be sneakily modified when passed to other code >>> - Immutable objects are naturally thread-safe and can therefore be >>> safely shared among threads >>> - No excessive copying >>> - No excessive synchronization >>> - Object definitions are pleasant to write and read >>> - No boilerplate setter and getters >>> - No IDE-generated hashCode, equals and toString methods >>> required. However, custom implementation for these methods can still >>> be >>> written >>> >>> >>> I did a small PoC in #3688 <https://github.com/apache/iceberg/pull/3688> >>> (TableIdentifier/Namespace) >>> and #3580 <https://github.com/apache/iceberg/pull/3580> (TableMetadata) >>> but notice that both PRs still don't show the full potential of Immutables >>> as they only convert objects to being Immutable + using the generated >>> builders + doing some fairly easy validation checks. >>> The real benefit gets more obvious when we can get rid of hand-written >>> JSON serializers/deserializers. >>> >>> >>> Eduard >>> >> >> >> -- >> Ryan Blue >> Tabular >> >