I created #3697 <https://github.com/apache/iceberg/pull/3697> that gives an
example of how subclasses of *FileScanTask* can be converted to Immutables.

On Thu, Dec 9, 2021 at 1:53 AM Jack Ye <yezhao...@gmail.com> wrote:

> For JSON serialization of objects such as FileScanTask, I have been
> experimenting about the best way to achieve that. So far the best way I see
> is actually still to use a fully customized JSON parser like what we did
> for TableMetadata. This is because there are many sub-classes of the
> FileScanTask interface, and serialization has to be defined all the way
> down to some very nested objects like Expression. I don't see how that
> could be easily done by Immutables or Jackson annotations, but I am not
> very experienced in Java annotation processor frameworks. I can publish a
> PR for more discussions around that area.
>
> -Jack
>
> On Wed, Dec 8, 2021 at 3:08 PM Ryan Blue <b...@tabular.io> wrote:
>
>> This could be valuable for some things, but I'm skeptical that we would
>> want to rewrite significant pieces of existing code this way.
>>
>> For example, you mentioned being able to "get rid of hand-written JSON
>> serializers/deserializers". Automatic serialization is fiddly. You end up
>> needing to scatter customizations around the code to annotate fields and
>> add conversions. Then those annotations become the source of truth for
>> serialization. If someone doesn't know what annotation to add to a field or
>> just doesn't consider serialization you can serialize either more than you
>> intended or something incorrect. This doesn't fit well with Iceberg, where
>> the serialization to JSON is documented in the spec and is strict.
>>
>> In addition to making it hard to maintain a JSON format that aligns with
>> the spec, this change isn't really needed. The existing parsers are already
>> written and don't change much. We don't have to worry about updating
>> dependencies and the implications to runtime Jars. But this would introduce
>> new (and unnecessary) risk to integrations.
>>
>> To make that risk more concrete, consider object serialization in Spark.
>> Today, we support Java, JSON, and Kryo serialization. There are places
>> where we've needed to adapt the implementations and stop storing immutable
>> collections for Kryo, or else Kryo can't reconstruct objects. So instead of
>> storing an ImmutableMap, we store a HashMap and wrap it to be unmodifiable.
>> If we were to rewrite classes that get serialized, we'd need to ensure
>> we're not causing regressions where classes become incompatible with
>> serialization that currently works.
>>
>> I don't think I would opt to spend time converting existing classes that
>> we already know work, when those already have working JSON parsers, Java
>> and Kryo serialization, and builders. Introducing big changes to achieve
>> the same functionality is a pretty significant risk and should have a
>> bigger upside than reducing lines of code, in my opinion.
>>
>> There may be uses other than converting existing code and there could be
>> low risk existing classes that we can convert. Using this to support JSON
>> serialization for `FileScanTask` is a good example because we need JSON
>> serialization for Trino, but it doesn't need to conform to a spec.
>>
>> Ryan
>>
>> On Wed, Dec 8, 2021 at 4:37 AM Eduard Tudenhoefner <edu...@dremio.com>
>> wrote:
>>
>>> Hello everyone,
>>>
>>> I was curious what people think about introducing/using the Immutables
>>> <https://immutables.github.io/> library in Iceberg. The Immutables
>>> library (Apache License 2.0) works via annotation processing, where an
>>> abstract value class is annotated with *@Value.Immutable* to generate
>>> an *immutable implementation class*.
>>>
>>> The main motivation for the Immutables library is outlined below:
>>>
>>>    - Use true immutable objects that are type-safe, thread-safe,
>>>    null-safe
>>>    - Get builder classes for free
>>>    - reduce amount of boiler-plate code
>>>    - Get JSON serialization/deserialization for free (this will also be
>>>    helpful for the REST-based Catalog
>>>    <https://github.com/apache/iceberg/pull/3424>) since Immutable
>>>    objects are serialization ready (including JSON and its binary forms)
>>>    - Supports lazy, derived and optional attributes
>>>    - Immutable objects are constructed once, in a consistent state, and
>>>    can be safely shared
>>>       - Will fail if mandatory attributes are missing
>>>       - Cannot be sneakily modified when passed to other code
>>>    - Immutable objects are naturally thread-safe and can therefore be
>>>    safely shared among threads
>>>       - No excessive copying
>>>       - No excessive synchronization
>>>    - Object definitions are pleasant to write and read
>>>       - No boilerplate setter and getters
>>>       - No IDE-generated hashCode, equals and toString methods
>>>       required. However, custom implementation for these methods can still 
>>> be
>>>       written
>>>
>>>
>>> I did a small PoC in #3688 <https://github.com/apache/iceberg/pull/3688> 
>>> (TableIdentifier/Namespace)
>>> and #3580 <https://github.com/apache/iceberg/pull/3580> (TableMetadata)
>>> but notice that both PRs still don't show the full potential of Immutables
>>> as they only convert objects to being Immutable + using the generated
>>> builders + doing some fairly easy validation checks.
>>> The real benefit gets more obvious when we can get rid of hand-written
>>> JSON serializers/deserializers.
>>>
>>>
>>> Eduard
>>>
>>
>>
>> --
>> Ryan Blue
>> Tabular
>>
>

Reply via email to