My intention was not to rewrite significant pieces of existing & working
code. The experiment with *TableMetadata* was for example to show an
alternative approach to writing simple hand-written builders (since we get
the builder for free). #3580 <https://github.com/apache/iceberg/pull/3580>
just converted *TableMetadata* to be an immutable class but still using the
existing *TableMetadataParser*.

I'm not saying we should rewrite existing hand-written serializers, I'm
rather saying that we *could* if the serializers for example don't have any
complex logic for backwards compatibility in them and don't have to adhere
to a strict spec.
Also my statement was rather referring to the hand-written JSON
serializers/deserializers that are currently being created for the *REST-based
Catalog* here
<https://github.com/apache/iceberg/pull/3424/files#diff-cf0695b97e3a4acd04c24f7712a8ca35a6c7d81005a5b27f97401ba2a93952f9R54-R64>,
where things like *TableIdentifier/NameSpace/PartitionSpec/SortOrder*/...
need to be handled for JSON.
I was talking to Kyle on Slack about this and wanted to help out with this
while he's working on the other pieces of the REST-based catalog. Thus I
thought it's a good idea to handle this via Immutables and get the JSON
ser/de stuff for free.

Thanks for mentioning the *FileScanTask*. I'll check and see what can be
done there.

Eduard

On Thu, Dec 9, 2021 at 12:08 AM Ryan Blue <b...@tabular.io> wrote:

> This could be valuable for some things, but I'm skeptical that we would
> want to rewrite significant pieces of existing code this way.
>
> For example, you mentioned being able to "get rid of hand-written JSON
> serializers/deserializers". Automatic serialization is fiddly. You end up
> needing to scatter customizations around the code to annotate fields and
> add conversions. Then those annotations become the source of truth for
> serialization. If someone doesn't know what annotation to add to a field or
> just doesn't consider serialization you can serialize either more than you
> intended or something incorrect. This doesn't fit well with Iceberg, where
> the serialization to JSON is documented in the spec and is strict.
>
> In addition to making it hard to maintain a JSON format that aligns with
> the spec, this change isn't really needed. The existing parsers are already
> written and don't change much. We don't have to worry about updating
> dependencies and the implications to runtime Jars. But this would introduce
> new (and unnecessary) risk to integrations.
>
> To make that risk more concrete, consider object serialization in Spark.
> Today, we support Java, JSON, and Kryo serialization. There are places
> where we've needed to adapt the implementations and stop storing immutable
> collections for Kryo, or else Kryo can't reconstruct objects. So instead of
> storing an ImmutableMap, we store a HashMap and wrap it to be unmodifiable.
> If we were to rewrite classes that get serialized, we'd need to ensure
> we're not causing regressions where classes become incompatible with
> serialization that currently works.
>
> I don't think I would opt to spend time converting existing classes that
> we already know work, when those already have working JSON parsers, Java
> and Kryo serialization, and builders. Introducing big changes to achieve
> the same functionality is a pretty significant risk and should have a
> bigger upside than reducing lines of code, in my opinion.
>
> There may be uses other than converting existing code and there could be
> low risk existing classes that we can convert. Using this to support JSON
> serialization for `FileScanTask` is a good example because we need JSON
> serialization for Trino, but it doesn't need to conform to a spec.
>
> Ryan
>
> On Wed, Dec 8, 2021 at 4:37 AM Eduard Tudenhoefner <edu...@dremio.com>
> wrote:
>
>> Hello everyone,
>>
>> I was curious what people think about introducing/using the Immutables
>> <https://immutables.github.io/> library in Iceberg. The Immutables
>> library (Apache License 2.0) works via annotation processing, where an
>> abstract value class is annotated with *@Value.Immutable* to generate an 
>> *immutable
>> implementation class*.
>>
>> The main motivation for the Immutables library is outlined below:
>>
>>    - Use true immutable objects that are type-safe, thread-safe,
>>    null-safe
>>    - Get builder classes for free
>>    - reduce amount of boiler-plate code
>>    - Get JSON serialization/deserialization for free (this will also be
>>    helpful for the REST-based Catalog
>>    <https://github.com/apache/iceberg/pull/3424>) since Immutable
>>    objects are serialization ready (including JSON and its binary forms)
>>    - Supports lazy, derived and optional attributes
>>    - Immutable objects are constructed once, in a consistent state, and
>>    can be safely shared
>>       - Will fail if mandatory attributes are missing
>>       - Cannot be sneakily modified when passed to other code
>>    - Immutable objects are naturally thread-safe and can therefore be
>>    safely shared among threads
>>       - No excessive copying
>>       - No excessive synchronization
>>    - Object definitions are pleasant to write and read
>>       - No boilerplate setter and getters
>>       - No IDE-generated hashCode, equals and toString methods required.
>>       However, custom implementation for these methods can still be written
>>
>>
>> I did a small PoC in #3688 <https://github.com/apache/iceberg/pull/3688> 
>> (TableIdentifier/Namespace)
>> and #3580 <https://github.com/apache/iceberg/pull/3580> (TableMetadata)
>> but notice that both PRs still don't show the full potential of Immutables
>> as they only convert objects to being Immutable + using the generated
>> builders + doing some fairly easy validation checks.
>> The real benefit gets more obvious when we can get rid of hand-written
>> JSON serializers/deserializers.
>>
>>
>> Eduard
>>
>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to