> I don't quite see why `StructTransformation` would preserve nesting For sorting purposes, there is no need to preserve nesting.
> make it simpler by just using a SortKey, `SortKey` sounds good to me. > How would you run a transformation and get back the original row? There is no way to get back the original row > How would nested fields work? Let's look at the following struct with a bucketing transformation on the field 3 (nested str field) Schema schema = new Schema( Lists.newArrayList( optional(1, "uuid", Types.UUIDType.get()), required( 2, "struct", Types.StructType.of( required(3, "struct_str", Types.StringType .get())))); A transformed schema would just change the type of field 3 from string to int. This is the weird part and maybe doesn't make a whole lot of sense. But the `Comparators` would still work with this model. Schema schema = new Schema( Lists.newArrayList( optional(1, "uuid", Types.UUIDType.get()), required( 2, "struct", Types.StructType.of( required(3, "struct_str", Types.IntegerType .get())))); On Mon, Jun 5, 2023 at 4:15 PM Ryan Blue <b...@tabular.io> wrote: > I don't quite see why `StructTransformation` would preserve nesting. > Aren't you basically running something that would create a new row from an > existing one, optionally transforming the values at the same time? How > would you run a transformation and get back the original row? How would > nested fields work? > > I think you might want to make it simpler by just using a SortKey, like > you mentioned. > > On Sat, Jun 3, 2023 at 8:03 AM Steven Wu <stevenz...@gmail.com> wrote: > >> Ryan, thanks a lot for the feedback. Will use `StructType` when >> applicable. >> >> `PartitionKey` is a combination of `StructProjection` and >> `StructTransformation` with a flattened array of partition tuples. This >> pattern of flattened arrays can also work for the SortOrder purpose. But it >> is not the `StructTransformation` that I had in mind earlier, where the >> original structure (like nesting) was maintained and only primitive types >> and values were transformed. If we go with the `PartitionKey` pattern, >> maybe we can call it `SortKey`. >> >> public class SortKey implements StructLike { >> public SortKey(Schema schema, SortOrder sortOrder) {} >> } >> >> Originally, I was thinking about keeping `StructProjection` and >> `StructTransformation` separate. For SortOrder comparison, we can chain >> those two together: structTransformation.wrap(structProjection.wrap(...)) >> . >> >> Any preference between the two choices? It probably boils down to if >> `StructTransformation` can be useful as a standalone class. >> >> On Fri, Jun 2, 2023 at 4:04 PM Ryan Blue <b...@tabular.io> wrote: >> >>> This all sounds pretty reasonable to me, although I'd use `StructType` >>> rather than `Schema` in most places so this is more reusable. I definitely >>> agree about reusing the existing tooling for `StructLike` rather than >>> re-implementing. I'd also recommend using sort order so you can use >>> transforms. Otherwise you'll just have to add it later. >>> >>> Also, check out how `PartitionKey` works because I think that's >>> basically the same thing as `StructTransformation`, just with a different >>> name. >>> >>> On Thu, Jun 1, 2023 at 3:31 AM Péter Váry <peter.vary.apa...@gmail.com> >>> wrote: >>> >>>> Good point. >>>> Stick to the conventions then >>>> >>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. máj. 31., >>>> Sze, 17:14): >>>> >>>>> Peter, >>>>> >>>>> I also thought about that. Didn't go with ` >>>>> StructTransformation.schema()`, because I was hoping to stick with >>>>> the `StructLike` interface which doesn't expose `schema()`. Trying to >>>>> mimic the behavior of `StructProjection`, which doesn't expose ` >>>>> schema()`. Projected schema can be extracted via `TypeUtil.project(Schema >>>>> schema, Set<Integer> fieldIds)`. >>>>> >>>>> Thanks, >>>>> Steven >>>>> >>>>> On Wed, May 31, 2023 at 1:18 AM Péter Váry < >>>>> peter.vary.apa...@gmail.com> wrote: >>>>> >>>>>> > 4. To represent the transformed struct, we need a transformed >>>>>> schema. I am thinking about adding a transform method to TypeUtil. It >>>>>> will >>>>>> return a transformed schema with field types updated to the result types >>>>>> of >>>>>> the transforms. This can look a bit weird with field types changed. >>>>>> > >>>>>> > public static Schema transform(Schema schema, Map<Integer, >>>>>> Transform<?, ?>> idToTransforms) >>>>>> >>>>>> Wouldn't it make sense to get the Schema for the ` >>>>>> StructTransformation` object instead, like ` >>>>>> StructTransformation.schema()`? >>>>>> >>>>>> Steven Wu <stevenz...@gmail.com> ezt írta (időpont: 2023. máj. 31., >>>>>> Sze, 7:19): >>>>>> >>>>>>> We are implementing a range partitioner for Flink sink shuffling >>>>>>> [1]. One key piece is RowDataComparator for Flink RowData. Would love to >>>>>>> get some feedback on a few decisions. >>>>>>> >>>>>>> 1. Comparators for Flink `RowData` type. Flink already has the >>>>>>> `RowDataWrapper` class that can wrap a `RowData` as a `StructLike`. With >>>>>>> `StructLike`, Iceberg `Comparators` can be used to compare two structs. >>>>>>> Then we don't need to implement `RowDataComparators` that look very >>>>>>> similar >>>>>>> to struct `Comparators`. This is also related to the transformation >>>>>>> decision below. We don't need to re-implement all the transform >>>>>>> functions >>>>>>> with Flink data types. >>>>>>> >>>>>>> 2. Use SortOrder or just natural orders (with null first). SortOrder >>>>>>> supports transform functions (like bucket, hours, truncate). The >>>>>>> implementation will be a lot simpler if we only need to implement >>>>>>> natural >>>>>>> order without transformations from SortOrder. But I do think the >>>>>>> transformations (like days, bucket) in SortOrder are quite useful. >>>>>>> >>>>>>> In addition to the current transforms, we plan to add a >>>>>>> `relative_hour` transform for event time partitioned tables. Flink range >>>>>>> shuffle calculates traffic statistics across keys (like number of >>>>>>> observed >>>>>>> rows per event hour). Ideally the traffic distributions should be >>>>>>> relatively stable. Hence relative hour (hour 0 meaning current hour) can >>>>>>> result in the stable statistics for traffic weight across the relative >>>>>>> event hours. >>>>>>> >>>>>>> 3. I am thinking about adding a `StructTransformation` class in the >>>>>>> iceberg-api module. It can be implemented similar to `StructProjection` >>>>>>> where transform functions are applied lazily during get. >>>>>>> >>>>>>> public static StructTransformation create(Schema schema, >>>>>>> Map<Integer, Transform<?, ?>> idToTransforms) >>>>>>> >>>>>>> 4. To represent the transformed struct, we need a transformed >>>>>>> schema. I am thinking about adding a transform method to TypeUtil. It >>>>>>> will >>>>>>> return a transformed schema with field types updated to the result >>>>>>> types of >>>>>>> the transforms. This can look a bit weird with field types changed. >>>>>>> >>>>>>> public static Schema transform(Schema schema, Map<Integer, >>>>>>> Transform<?, ?>> idToTransforms) >>>>>>> >>>>>>> ========================= >>>>>>> This is how everything is put together for RowDataComparator. >>>>>>> >>>>>>> Schema projected = TypeUtil.select(schema, sortFieldIds); // >>>>>>> sortFieldIds set is calculated from SortOrder >>>>>>> Map<Integer, Transform<?, ?>> idToTransforms) idToTransforms = // >>>>>>> calculated from SortOrder >>>>>>> Schema sortSchema = TypeUtil.transform(projected, idToTransforms); >>>>>>> >>>>>>> StructLike leftSortKey = >>>>>>> structTransformation.wrap(structProjection.wrap(rowDataWrapper.wrap(leftRowData))) >>>>>>> StructLike rightSortKey = >>>>>>> structTransformation.wrap(structProjection.wrap(rowDataWrapper.wrap(leftRowData))) >>>>>>> >>>>>>> Comparators.forType(sortSchema).compare(leftSortKey, rightSortKey) >>>>>>> >>>>>>> Thanks, >>>>>>> Steven >>>>>>> >>>>>>> [1] >>>>>>> https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/ >>>>>>> >>>>>> >>> >>> -- >>> Ryan Blue >>> Tabular >>> >> > > -- > Ryan Blue > Tabular >