Hi everyone,

Unaligned Checkpoint (FLIP-76 [1]) is a major feature of Flink. It
effectively solves the problem of checkpoint timeout or slow checkpoint
when backpressure is severe.

We found that UC(Unaligned Checkpoint) does not work well when the back
pressure is severe and multiple output buffers are required to process a
single record. FLINK-14396 [2] also mentioned this issue before. So we
propose the overdraft buffer to solve it.

I created FLINK-26762[3] and FLIP-227[4] to detail the overdraft buffer
mechanism. After discussing with Anton Kalashnikov, there are still some
points to discuss:

   - There are already a lot of buffer-related configurations. Do we need
   to add a new configuration for the overdraft buffer?
   - Where should the overdraft buffer use memory?
   - If the overdraft-buffer uses the memory remaining in the
   NetworkBufferPool, no new configuration needs to be added.
   - If adding a new configuration:
   - Should we set the overdraft-memory-size at the TM level or the Task
      level?
      - Or set overdraft-buffers to indicate the number of memory-segments
      that can be overdrawn.
      - What is the default value? How to set sensible defaults?

Currently, I implemented a POC [5] and verified it using flink-benchmarks
[6]. The POC sets overdraft-buffers at Task level, and default value is 10.
That is: each LocalBufferPool can overdraw up to 10 memory-segments.

Looking forward to your feedback!

Thanks,
fanrui

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-76%3A+Unaligned+Checkpoints
[2] https://issues.apache.org/jira/browse/FLINK-14396
[3] https://issues.apache.org/jira/browse/FLINK-26762
[4]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-227%3A+Support+overdraft+buffer
[5]
https://github.com/1996fanrui/flink/commit/c7559d94767de97c24ea8c540878832138c8e8fe
[6] https://github.com/apache/flink-benchmarks/pull/54

Reply via email to