Re: Core changes for Flink Dynamic Iceberg Sink

Maximilian Michels Wed, 21 May 2025 02:35:17 -0700

Thanks Ryan! I totally missed those static method constructors (maybe
because they are in an entirely different class). In that case, we
don't need change (1) and (2).


For SchemaUpdate, I agree we could implement the interface. The
drawback is that it won't verify test results with the actual
implementation, runtime behavior may be completely different.
Obviously, we can still assert certain behavior, but we lose some
confidence.

Cheers,
Max

On Tue, May 20, 2025 at 11:52 PM Ryan Blue <[email protected]> wrote:
>
> Max,
>
> Can you use the factory methods in Expressions rather than changing 
> visibility?
>
> Also, I don’t think that making SchemaUpdate public is a good idea. It has a 
> public interface that you can use and I don’t agree with exposing internal 
> implementation classes just to write tests. I would expect test cases to be 
> specific, much like how SchemaUpdate is itself tested.
>
> Ryan
>
>
> On Mon, May 19, 2025 at 9:41 AM Maximilian Michels <[email protected]> wrote:
>>
>> Hi,
>>
>> The Flink Dynamic Iceberg Sink requires a few iceberg-core class
>> visibility changes, which I'd like to get your feedback on.
>>
>> Here is the diff:
>> https://github.com/apache/iceberg/pull/13032/commits/1804b0ac4ff97c3c943463725e91a1e24b0f8c44
>>
>> There are three visibility changes:
>>
>> Change 1: Make NamedReference constructor public
>> Change 2: Make UnboundTransform constructor public
>> Change 3: Make SchemaUpdate class and constructor public
>>
>> Change (1) and (2) both fall under the same feature, which is
>> rebinding PartitionSpec transforms.
>>
>> Background: The Dynamic Iceberg Sink can dynamically evolve Schema and
>> PartitionSpec of a table. The user provides schema / spec in the input
>> record. However, to ensure the changes are safe, we don't directly
>> take the user schema / spec, but we rebind it to one of the existing
>> schemas / specs. We do that by name-matching the fields in the
>> user-provided schema with the fields in the table schemas. This allows
>> us to directly use or evolve an existing schema and to preserve the
>> internal field ids. Name-based schema evolution is also what the
>> Iceberg Schema API offers.
>>
>> Similarly, user-provided PartitionSpecs need to be rewritten for their
>> transforms to get bound to the correct source id, which is one of the
>> field ids of the table schema. The UpdatePartitionSpec API works on
>> Terms which is what we generate with this code and the class
>> visibility changes:
>> https://github.com/apache/iceberg/blob/61fedb11399ee344f9621362521e8a5071e91c05/flink/v2.0/flink/src/main/java/org/apache/iceberg/flink/sink/dynamic/PartitionSpecEvolution.java#L124
>> In order to rebind the provided PartitionSpec to the id of the
>> name-matching field in the table schema, we need the constructor of
>> NamedReference and UnboundTransform to be public.
>>
>> Change (3) is used to test the schema update code. We want to test the
>> schema evolution code, which uses SchemaUpdate API to make the
>> changes. Our tests validate that the applied changes yield the
>> expected new schema. Example test code:
>> https://github.com/apache/iceberg/blob/61fedb11399ee344f9621362521e8a5071e91c05/flink/v2.0/flink/src/test/java/org/apache/iceberg/flink/sink/dynamic/TestEvolveSchemaVisitor.java#L111
>>
>> I hope that makes sense. I'm curious to hear what you think. Any concerns?
>>
>> Cheers,
>> Max

Re: Core changes for Flink Dynamic Iceberg Sink

Reply via email to