[DISCUSS] streaming shuffle to improve data clustering and tame small files problem

Steven Wu Mon, 30 Jan 2023 16:46:13 -0800

Hi,

We had a proposal to add a streaming shuffling stage in the Flink Iceberg
sink to to improve data clustering and tame the small files problem [1].


Here are a couple of common use cases.
* Event time partitioned table where we can get small files problem due to
skewed and long-tail distribution on event time hours.
* Improve data clustering on non-partitioned columns (e.g. device_id) where
table format can leverage min-max value range for effective file pruning.

The main idea is to calculate (skewed) traffic distribution statistics and
shuffle records based on the computed statistics. This can achieve good
data clustering on the writer subtasks while largely avoiding small files
and maintaining relatively balanced traffic volume across writer subtasks.
We finished a PoC on event time partitioned tables and saw 20x reduction on
number of files.

In another thread, there is a question if it makes sense to add this
clustering shuffle feature to Flink DataStream [2], as it can potentially
be useful for other sinks (like files, Apache Hudi, Delta Lake). Hence we
would like to gauge the community's initial interests first before writing
up a large FLIP.

Thanks,
Steven

[1]
https://docs.google.com/document/d/13N8cMqPi-ZPSKbkXGOBMPOzbv2Fua59j8bIjjtxLWqo
[2]
https://lists.apache.org/list?dev@flink.apache.org:lte=1M:%22[DISCUSS]%20FLIP-264%20Extract%20BaseCoordinatorContext%22

[DISCUSS] streaming shuffle to improve data clustering and tame small files problem

Reply via email to