Hi Dongwon, There are currently no metrics for the async work-queue size (you should be able to see the queue stats with debug logs enabled though [1]). As far as I can tell from looking at the code, the async operator is able to checkpoint even if the work-queue is exhausted.
Arvid can you please validate the above? (the checkpoints not being blocked by the work queue part) [1] https://github.com/apache/flink/blob/release-1.14.0/flink-streaming-java/src/main/java/org/apache/flink/streaming/api/operators/async/queue/OrderedStreamElementQueue.java#L109 Best, D. On Sun, Nov 7, 2021 at 10:41 AM Dongwon Kim <eastcirc...@gmail.com> wrote: > Hi community, > > While using Flink's async i/o for interacting with an external system, I > got the following exception: > > 2021-11-06 10:38:35,270 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 54 (type=CHECKPOINT) @ 1636162715262 for job > f168a44ea33198cd71783824d49f9554. > 2021-11-06 10:38:47,031 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Completed > checkpoint 54 for job f168a44ea33198cd71783824d49f9554 (11930992707 bytes, > checkpointDuration=11722 ms, finalizationTime=47 ms). > 2021-11-06 10:58:35,270 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Triggering > checkpoint 55 (type=CHECKPOINT) @ 1636163915262 for job > f168a44ea33198cd71783824d49f9554. > 2021-11-06 11:08:35,271 INFO > org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Checkpoint > 55 of job f168a44ea33198cd71783824d49f9554 expired before completing. > 2021-11-06 11:08:35,287 INFO org.apache.flink.runtime.jobmaster.JobMaster > [] - Trying to recover from a global failure. > > > - FYI, I'm using 1.14.0 and enabled unaligned checkpointing and buffer > debloating > - the 55th ckpt failed to complete within 10 mins (which is the value of > execution.checkpointing.timeout) > - the below graph shows that backpressure skyrocketed around the time the > 55th ckpt began > [image: image.png] > > What I suspect is the capacity of the asynchronous operation because > limiting the value can cause back-pressure once the capacity is exhausted > [1]. > > Although I could increase the value, I want to monitor the current > in-flight async i/o requests like the above back-pressure graph on Grafana. > [2] does not introduce any system metric specific to async i/o. > > Best, > > Dongwon > > [1] > https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api > [2] > https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#system-metrics > > >