Thanks for creating this FLIP Fabian. >From your description I would be in favour of option 2 for the following reasons: Assuming that option 2 solves all our current problems, it seems like the least invasive change and smallest in scope. Your main concern is that it might not cover future use cases. Do you have some specific use cases in mind? I think it is ok to extend the existing interfaces in order to cover new requirements once we learn about them. The important bit is that we don't implement a solution from which we know that it won't solve all requirements at the time of implementation. What I am missing a bit from the description is how option 2 will behave wrt checkpoints and the batch execution mode.
Option 1 will require the generalization of the operator coordinator framework to participate in the checkpointing at an arbitrary position in the topology. Moreover, it seems as if this option exploits the JobMaster process to run some user code that could also be done in a parallelism 1 operator (so option 2 should be able to solve this use case). Option 3 sounds like the most generic approach. But with a lot of power comes also some responsibility and I could see that being able to insert an arbitrary topology that has to work with streaming and batch can become quite a challenge for sink developers. I think it would be easier if there were more fixed dimensions for a sink developer if possible. I've left some more comments on the wiki page. PTAL. Cheers, Till On Tue, Nov 2, 2021 at 5:44 PM Fabian Paul <fabianp...@ververica.com> wrote: > Hi all, > > More and more data lake sinks rely on columnar formats which benefit from > few larger files than a lot of small files (read amplification). > Our current FileSink cannot ensure a certain size when writing to an > external filesystem which I call the small file compaction > problem. Unfortunately, there is no good way with the current unified Sink > operator topology to support this use case. > > I would like to propose to extend the unified Sink interface which we > proposed in FLIP-143 to resolve the small file compaction problem. > Therefore I have created FLIP-191 [1] to outline three different options > how the problem could be addressed. > > 1. Global Sink Coordinator > 2. Committable Aggregator Operator > 3. Custom sink topology > > Further information about the alternatives can be found in the document > and I would appreciate your feedback to decide on which way to go to > finally resolve this problem. > > Best, > Fabian > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction > >