Ryan, thanks a lot for reviewing the proposal thoroughly. Originally, I was only thinking about the use cases of low-cardinality shuffling columns. You and Russel raised concerns on how this works with high cardinality columns. I agree that we can take the range sketches approach to make it work with high-cardinality use cases too.
I have one concern regarding the sketches approach. Batch execution only computes the sketches once and uses it once. In long-running streaming execution, sketches are computed and refreshed periodically. The sketches may not be stable and change frequently. Unstable sketches can lead to constantly changing shuffling. On a related but separate note, range shuffling typically works well with sorting, as sorted data can have better compression. Do we see the need to support local sorting in Flink Iceberg writer? it is probably a separate discussion. On Mon, Oct 25, 2021 at 5:05 PM Ryan Blue <b...@tabular.io> wrote: > +1 to this proposal. It's well thought out and easy to read. > > The only issue with it is that I think we can take a different approach to > range sketches (like what Spark does) to avoid a huge amount of state or > limitations on the sort key cardinality. Otherwise, it all looks good to me. > > Thanks for the awesome proposal, Steven! > > Ryan > > On Thu, Oct 14, 2021 at 8:56 AM Steven Wu <stevenz...@gmail.com> wrote: > >> Hi, >> >> I wrote a design doc for bin packing and range distribution shuffling >> support in Flink Iceberg sink for streaming ingestion. Would appreciate >> your feedback. >> >> >> https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/ >> >> Thanks, >> Steven >> > > > -- > Ryan Blue > Tabular >