Thanks for creating this FLIP Fabian.

>From your description I would be in favour of option 2 for the following
reasons: Assuming that option 2 solves all our current problems, it seems
like the least invasive change and smallest in scope. Your main concern is
that it might not cover future use cases. Do you have some specific use
cases in mind? I think it is ok to extend the existing interfaces in order
to cover new requirements once we learn about them. The important bit is
that we don't implement a solution from which we know that it won't solve
all requirements at the time of implementation. What I am missing a bit
from the description is how option 2 will behave wrt checkpoints and the
batch execution mode.

Option 1 will require the generalization of the operator coordinator
framework to participate in the checkpointing at an arbitrary position in
the topology. Moreover, it seems as if this option exploits the JobMaster
process to run some user code that could also be done in a parallelism 1
operator (so option 2 should be able to solve this use case).

Option 3 sounds like the most generic approach. But with a lot of power
comes also some responsibility and I could see that being able to insert an
arbitrary topology that has to work with streaming and batch can become
quite a challenge for sink developers. I think it would be easier if there
were more fixed dimensions for a sink developer if possible.

I've left some more comments on the wiki page. PTAL.

Cheers,
Till

On Tue, Nov 2, 2021 at 5:44 PM Fabian Paul <fabianp...@ververica.com> wrote:

> Hi all,
>
> More and more data lake sinks rely on columnar formats which benefit from
> few larger files than a lot of small files (read amplification).
> Our current FileSink cannot ensure a certain size when writing to an
> external filesystem which I call the small file compaction
> problem. Unfortunately, there is no good way with the current unified Sink
> operator topology to support this use case.
>
> I would like to propose to extend the unified Sink interface which we
> proposed in FLIP-143 to resolve the small file compaction problem.
> Therefore I have created FLIP-191 [1] to outline three different options
> how the problem could be addressed.
>
> 1. Global Sink Coordinator
> 2. Committable Aggregator Operator
> 3. Custom sink topology
>
> Further information about the alternatives can be found in the document
> and I would appreciate your feedback to decide on which way to go to
> finally resolve this problem.
>
> Best,
> Fabian
>
> [1]
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-191%3A+Extend+unified+Sink+interface+to+support+small+file+compaction
>
>

Reply via email to