Hi all,

More and more data lake sinks rely on columnar formats which benefit from few 
larger files than a lot of small files (read amplification). 
Our current FileSink cannot ensure a certain size when writing to an external 
filesystem which I call the small file compaction 
problem. Unfortunately, there is no good way with the current unified Sink 
operator topology to support this use case.

I would like to propose to extend the unified Sink interface which we proposed 
in FLIP-143 to resolve the small file compaction problem.
Therefore I have created FLIP-191 [1] to outline three different options how 
the problem could be addressed.

1. Global Sink Coordinator
2. Committable Aggregator Operator
3. Custom sink topology

Further information about the alternatives can be found in the document and I 
would appreciate your feedback to decide on which way to go to 
finally resolve this problem.

Best,
Fabian

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction

Reply via email to