Ryan, thanks a lot for reviewing the proposal thoroughly.

Originally, I was only thinking about the use cases of low-cardinality
shuffling columns. You and Russel raised concerns on how this works with
high cardinality columns. I agree that we can take the range sketches
approach to make it work with high-cardinality use cases too.

I have one concern regarding the sketches approach. Batch execution only
computes the sketches once and uses it once. In long-running
streaming execution, sketches are computed and refreshed periodically. The
sketches may not be stable and change frequently. Unstable sketches can
lead to constantly changing shuffling.

On a related but separate note, range shuffling typically works well with
sorting, as sorted data can have better compression. Do we see the need to
support local sorting in Flink Iceberg writer? it is probably a separate
discussion.



On Mon, Oct 25, 2021 at 5:05 PM Ryan Blue <b...@tabular.io> wrote:

> +1 to this proposal. It's well thought out and easy to read.
>
> The only issue with it is that I think we can take a different approach to
> range sketches (like what Spark does) to avoid a huge amount of state or
> limitations on the sort key cardinality. Otherwise, it all looks good to me.
>
> Thanks for the awesome proposal, Steven!
>
> Ryan
>
> On Thu, Oct 14, 2021 at 8:56 AM Steven Wu <stevenz...@gmail.com> wrote:
>
>> Hi,
>>
>> I wrote a design doc for bin packing and range distribution shuffling
>> support in Flink Iceberg sink for streaming ingestion. Would appreciate
>> your feedback.
>>
>>
>> https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/
>>
>> Thanks,
>> Steven
>>
>
>
> --
> Ryan Blue
> Tabular
>

Reply via email to