[DISCUSS] StreamingFileSink: parallelizing active buckets checkpointing?

Paul Bernier Fri, 31 Jul 2020 10:58:43 -0700

Hi all,

I was trying to use S3 StreamingFileSink with a high number of active buckets 
(>1000). I found that checkpointing duration will grow linearly with the number 
of active buckets, which makes achieving high number of active buckets 
difficult. One reason for that is that each active buckets are snapshotted 
sequentially in a 
loop<https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L245>.
 Given that operation involves waiting for some data to finish being uploaded 
to S3 that can become quite a long wait.


My question is: could this loop be safely multi-threaded?
Each Bucket seems independent (they do share the bucketWriter though). I have 
also done some basic prototyping and validation and it looks ok. So I wondering 
if I am overlooking anything and if this approach is viable?

Note: the same approach would also need to be applied to the 
onSuccessfulCompletionOfCheckpoint step with this while loop committing files 
to 
S3<https://github.com/apache/flink/blob/0a1b0c615c4968afd29f09a3494bbf137a223609/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/functions/sink/filesystem/Buckets.java#L208>.

Thank you.

Paul

[DISCUSS] StreamingFileSink: parallelizing active buckets checkpointing?

Reply via email to