Hi community,

While using Flink's async i/o for interacting with an external system, I
got the following exception:

2021-11-06 10:38:35,270 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Triggering checkpoint 54 (type=CHECKPOINT) @ 1636162715262 for job
f168a44ea33198cd71783824d49f9554.
2021-11-06 10:38:47,031 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Completed checkpoint 54 for job f168a44ea33198cd71783824d49f9554
(11930992707 bytes, checkpointDuration=11722 ms, finalizationTime=47
ms).
2021-11-06 10:58:35,270 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Triggering checkpoint 55 (type=CHECKPOINT) @ 1636163915262 for job
f168a44ea33198cd71783824d49f9554.
2021-11-06 11:08:35,271 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator    [] -
Checkpoint 55 of job f168a44ea33198cd71783824d49f9554 expired before
completing.
2021-11-06 11:08:35,287 INFO
org.apache.flink.runtime.jobmaster.JobMaster                 [] -
Trying to recover from a global failure.


- FYI, I'm using 1.14.0 and enabled unaligned checkpointing and buffer
debloating
- the 55th ckpt failed to complete within 10 mins (which is the value of
execution.checkpointing.timeout)
- the below graph shows that backpressure skyrocketed around the time the
55th ckpt began
[image: image.png]

What I suspect is the capacity of the asynchronous operation because
limiting the value can cause back-pressure once the capacity is exhausted
[1].

Although I could increase the value, I want to monitor the current
in-flight async i/o requests like the above back-pressure graph on Grafana.
[2] does not introduce any system metric specific to async i/o.

Best,

Dongwon

[1]
https://ci.apache.org/projects/flink/flink-docs-master/docs/dev/datastream/operators/asyncio/#async-io-api
[2]
https://ci.apache.org/projects/flink/flink-docs-master/docs/ops/metrics/#system-metrics

Reply via email to