Re: [DISCUSS] Flink range partitioner/shuffle

Steven Wu Mon, 05 Jun 2023 17:04:00 -0700

> I don't quite see why `StructTransformation` would preserve nesting

For sorting purposes, there is no need to preserve nesting.


> make it simpler by just using a SortKey,

`SortKey` sounds good to me.

>  How would you run a transformation and get back the original row?

There is no way to get back the original row

>  How would nested fields work?

Let's look at the following struct with a bucketing transformation on the
field 3 (nested str field)

    Schema schema =
        new Schema(
            Lists.newArrayList(
                optional(1, "uuid", Types.UUIDType.get()),
                required(
                    2,
                    "struct",
                    Types.StructType.of(
                        required(3, "struct_str", Types.StringType
.get()))));

A transformed schema would just change the type of field 3 from string to
int. This is the weird part and maybe doesn't make a whole lot of sense.
But the `Comparators` would still work with this model.

    Schema schema =
        new Schema(
            Lists.newArrayList(
                optional(1, "uuid", Types.UUIDType.get()),
                required(
                    2,
                    "struct",
                    Types.StructType.of(
                        required(3, "struct_str", Types.IntegerType
.get()))));





On Mon, Jun 5, 2023 at 4:15 PM Ryan Blue <b...@tabular.io> wrote:

> I don't quite see why `StructTransformation` would preserve nesting.
> Aren't you basically running something that would create a new row from an
> existing one, optionally transforming the values at the same time? How
> would you run a transformation and get back the original row? How would
> nested fields work?
>
> I think you might want to make it simpler by just using a SortKey, like
> you mentioned.
>
> On Sat, Jun 3, 2023 at 8:03 AM Steven Wu <stevenz...@gmail.com> wrote:
>
>> Ryan, thanks a lot for the feedback. Will use `StructType` when
>> applicable.
>>
>> `PartitionKey` is a combination of `StructProjection` and
>> `StructTransformation` with a flattened array of partition tuples. This
>> pattern of flattened arrays can also work for the SortOrder purpose. But it
>> is not the `StructTransformation` that I had in mind earlier, where the
>> original structure (like nesting) was maintained and only primitive types
>> and values were transformed. If we go with the `PartitionKey` pattern,
>> maybe we can call it `SortKey`.
>>
>> public class SortKey implements StructLike {
>>     public SortKey(Schema schema, SortOrder sortOrder) {}
>> }
>>
>> Originally, I was thinking about keeping `StructProjection` and
>> `StructTransformation` separate. For SortOrder comparison, we can chain
>> those two together: structTransformation.wrap(structProjection.wrap(...))
>> .
>>
>> Any preference between the two choices? It probably boils down to if
>> `StructTransformation` can be useful as a standalone class.
>>
>> On Fri, Jun 2, 2023 at 4:04 PM Ryan Blue <b...@tabular.io> wrote:
>>
>>> This all sounds pretty reasonable to me, although I'd use `StructType`
>>> rather than `Schema` in most places so this is more reusable. I definitely
>>> agree about reusing the existing tooling for `StructLike` rather than
>>> re-implementing. I'd also recommend using sort order so you can use
>>> transforms. Otherwise you'll just have to add it later.
>>>
>>> Also, check out how `PartitionKey` works because I think that's
>>> basically the same thing as `StructTransformation`, just with a different
>>> name.
>>>
>>> On Thu, Jun 1, 2023 at 3:31 AM Péter Váry <peter.vary.apa...@gmail.com>
>>> wrote:
>>>
>>>> Good point.
>>>> Stick to the conventions then
>>>>
>>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. máj. 31.,
>>>> Sze, 17:14):
>>>>
>>>>> Peter,
>>>>>
>>>>> I also thought about that. Didn't go with `
>>>>> StructTransformation.schema()`, because I was hoping to stick with
>>>>> the `StructLike` interface which doesn't expose `schema()`. Trying to
>>>>> mimic the behavior of `StructProjection`, which doesn't expose  `
>>>>> schema()`. Projected schema can be extracted via `TypeUtil.project(Schema
>>>>> schema, Set<Integer> fieldIds)`.
>>>>>
>>>>> Thanks,
>>>>> Steven
>>>>>
>>>>> On Wed, May 31, 2023 at 1:18 AM Péter Váry <
>>>>> peter.vary.apa...@gmail.com> wrote:
>>>>>
>>>>>> > 4. To represent the transformed struct, we need a transformed
>>>>>> schema. I am thinking about adding a transform method to TypeUtil. It 
>>>>>> will
>>>>>> return a transformed schema with field types updated to the result types 
>>>>>> of
>>>>>> the transforms. This can look a bit weird with field types changed.
>>>>>> >
>>>>>> > public static Schema transform(Schema schema, Map<Integer,
>>>>>> Transform<?, ?>> idToTransforms)
>>>>>>
>>>>>> Wouldn't it make sense to get the Schema for the `
>>>>>> StructTransformation` object instead, like `
>>>>>> StructTransformation.schema()`?
>>>>>>
>>>>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. máj. 31.,
>>>>>> Sze, 7:19):
>>>>>>
>>>>>>> We are implementing a range partitioner for Flink sink shuffling
>>>>>>> [1]. One key piece is RowDataComparator for Flink RowData. Would love to
>>>>>>> get some feedback on a few decisions.
>>>>>>>
>>>>>>> 1. Comparators for Flink `RowData` type. Flink already has the
>>>>>>> `RowDataWrapper` class that can wrap a `RowData` as a `StructLike`. With
>>>>>>> `StructLike`, Iceberg `Comparators` can be used to compare two structs.
>>>>>>> Then we don't need to implement `RowDataComparators` that look very 
>>>>>>> similar
>>>>>>> to struct `Comparators`. This is also related to the transformation
>>>>>>> decision below. We don't need to re-implement all the transform 
>>>>>>> functions
>>>>>>> with Flink data types.
>>>>>>>
>>>>>>> 2. Use SortOrder or just natural orders (with null first). SortOrder
>>>>>>> supports transform functions (like bucket, hours, truncate). The
>>>>>>> implementation will be a lot simpler if we only need to implement 
>>>>>>> natural
>>>>>>> order without transformations from SortOrder. But I do think the
>>>>>>> transformations (like days, bucket) in SortOrder are quite useful.
>>>>>>>
>>>>>>> In addition to the current transforms, we plan to add a
>>>>>>> `relative_hour` transform for event time partitioned tables. Flink range
>>>>>>> shuffle calculates traffic statistics across keys (like number of 
>>>>>>> observed
>>>>>>> rows per event hour). Ideally the traffic distributions should be
>>>>>>> relatively stable. Hence relative hour (hour 0 meaning current hour) can
>>>>>>> result in the stable statistics for traffic weight across the relative
>>>>>>> event hours.
>>>>>>>
>>>>>>> 3. I am thinking about adding a `StructTransformation` class in the
>>>>>>> iceberg-api module. It can be implemented similar to `StructProjection`
>>>>>>> where transform functions are applied lazily during get.
>>>>>>>
>>>>>>> public static StructTransformation create(Schema schema,
>>>>>>> Map<Integer, Transform<?, ?>> idToTransforms)
>>>>>>>
>>>>>>> 4. To represent the transformed struct, we need a transformed
>>>>>>> schema. I am thinking about adding a transform method to TypeUtil. It 
>>>>>>> will
>>>>>>> return a transformed schema with field types updated to the result 
>>>>>>> types of
>>>>>>> the transforms. This can look a bit weird with field types changed.
>>>>>>>
>>>>>>> public static Schema transform(Schema schema, Map<Integer,
>>>>>>> Transform<?, ?>> idToTransforms)
>>>>>>>
>>>>>>> =========================
>>>>>>> This is how everything is put together for RowDataComparator.
>>>>>>>
>>>>>>> Schema projected = TypeUtil.select(schema, sortFieldIds); //
>>>>>>> sortFieldIds set is calculated from SortOrder
>>>>>>> Map<Integer, Transform<?, ?>> idToTransforms) idToTransforms = //
>>>>>>> calculated from SortOrder
>>>>>>> Schema sortSchema = TypeUtil.transform(projected, idToTransforms);
>>>>>>>
>>>>>>> StructLike leftSortKey =
>>>>>>> structTransformation.wrap(structProjection.wrap(rowDataWrapper.wrap(leftRowData)))
>>>>>>> StructLike rightSortKey =
>>>>>>> structTransformation.wrap(structProjection.wrap(rowDataWrapper.wrap(leftRowData)))
>>>>>>>
>>>>>>> Comparators.forType(sortSchema).compare(leftSortKey, rightSortKey)
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Steven
>>>>>>>
>>>>>>> [1]
>>>>>>> https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
>>>>>>>
>>>>>>
>>>
>>> --
>>> Ryan Blue
>>> Tabular
>>>
>>
>
> --
> Ryan Blue
> Tabular
>

Re: [DISCUSS] Flink range partitioner/shuffle

Reply via email to