Hi all, More and more data lake sinks rely on columnar formats which benefit from few larger files than a lot of small files (read amplification). Our current FileSink cannot ensure a certain size when writing to an external filesystem which I call the small file compaction problem. Unfortunately, there is no good way with the current unified Sink operator topology to support this use case.
I would like to propose to extend the unified Sink interface which we proposed in FLIP-143 to resolve the small file compaction problem. Therefore I have created FLIP-191 [1] to outline three different options how the problem could be addressed. 1. Global Sink Coordinator 2. Committable Aggregator Operator 3. Custom sink topology Further information about the alternatives can be found in the document and I would appreciate your feedback to decide on which way to go to finally resolve this problem. Best, Fabian [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction