Hi, We had a proposal to add a streaming shuffling stage in the Flink Iceberg sink to to improve data clustering and tame the small files problem [1].
Here are a couple of common use cases. * Event time partitioned table where we can get small files problem due to skewed and long-tail distribution on event time hours. * Improve data clustering on non-partitioned columns (e.g. device_id) where table format can leverage min-max value range for effective file pruning. The main idea is to calculate (skewed) traffic distribution statistics and shuffle records based on the computed statistics. This can achieve good data clustering on the writer subtasks while largely avoiding small files and maintaining relatively balanced traffic volume across writer subtasks. We finished a PoC on event time partitioned tables and saw 20x reduction on number of files. In another thread, there is a question if it makes sense to add this clustering shuffle feature to Flink DataStream [2], as it can potentially be useful for other sinks (like files, Apache Hudi, Delta Lake). Hence we would like to gauge the community's initial interests first before writing up a large FLIP. Thanks, Steven [1] https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo [2] https://lists.apache.org/list?dev@flink.apache.org:lte=1M:%22[DISCUSS]%20FLIP-264%20Extract%20BaseCoordinatorContext%22