No, I don't think that local sorting in Flink is a good idea, at least for streaming. I'm not sure what the plan is on the batch side, but it probably only makes sense for batch.
On Mon, Oct 25, 2021 at 7:42 PM Steven Wu <stevenz...@gmail.com> wrote: > Ryan, thanks a lot for reviewing the proposal thoroughly. > > Originally, I was only thinking about the use cases of low-cardinality > shuffling columns. You and Russel raised concerns on how this works with > high cardinality columns. I agree that we can take the range sketches > approach to make it work with high-cardinality use cases too. > > I have one concern regarding the sketches approach. Batch execution only > computes the sketches once and uses it once. In long-running > streaming execution, sketches are computed and refreshed periodically. The > sketches may not be stable and change frequently. Unstable sketches can > lead to constantly changing shuffling. > > On a related but separate note, range shuffling typically works well with > sorting, as sorted data can have better compression. Do we see the need to > support local sorting in Flink Iceberg writer? it is probably a separate > discussion. > > > > On Mon, Oct 25, 2021 at 5:05 PM Ryan Blue <b...@tabular.io> wrote: > >> +1 to this proposal. It's well thought out and easy to read. >> >> The only issue with it is that I think we can take a different approach >> to range sketches (like what Spark does) to avoid a huge amount of state or >> limitations on the sort key cardinality. Otherwise, it all looks good to me. >> >> Thanks for the awesome proposal, Steven! >> >> Ryan >> >> On Thu, Oct 14, 2021 at 8:56 AM Steven Wu <stevenz...@gmail.com> wrote: >> >>> Hi, >>> >>> I wrote a design doc for bin packing and range distribution shuffling >>> support in Flink Iceberg sink for streaming ingestion. Would appreciate >>> your feedback. >>> >>> >>> https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo/ >>> >>> Thanks, >>> Steven >>> >> >> >> -- >> Ryan Blue >> Tabular >> > -- Ryan Blue Tabular