Just echo what Lu mentioned, is there documentation we can find more info on
* when barriers were instrumented at source from checkpoint coordinator
* when each down stream task observe first barrier of a chk
* when list of barriers of a chk arrives to a task
* when snapshot start/complete
* when upload to remote file system start/complete
* when ack send to checkpoint coordinator
For now, we only see checkpoint timeout due to a task can't finish in
time in flink UI, seems limited to debug further.
Chen
On 4/24/20 10:52 PM, Congxian Qiu wrote:
Hi
If the bottleneck is the upload part, did you even have tried upload
files using multithread[1]
[1] https://issues.apache.org/jira/browse/FLINK-11008
Best,
Congxian
Lu Niu <[email protected] <mailto:[email protected]>> 于2020年4月24日周五
下午12:38写道:
Hi, Robert
Thanks for relying. Yeah. After I added monitoring on the above
path, it shows the slowness did come from uploading file to s3.
Right now I am still investigating the issue. At the same time, I
am trying PrestoS3FileSystem to check whether that can mitigate
the problem.
Best
Lu
On Thu, Apr 23, 2020 at 8:10 AM Robert Metzger
<[email protected] <mailto:[email protected]>> wrote:
Hi Lu,
were you able to resolve the issue with the slow async
checkpoints?
I've added Yu Li to this thread. He has more experience with
the state backends to decide which monitoring is appropriate
for such situations.
Best,
Robert
On Tue, Apr 21, 2020 at 10:50 PM Lu Niu <[email protected]
<mailto:[email protected]>> wrote:
Hi, Robert
Thanks for replying. To improve observability , do you
think we should expose more metrics in checkpointing? for
example, in incremental checkpoint, the time spend on
uploading sst files?
https://github.com/apache/flink/blob/5b71c7f2fe36c760924848295a8090898cb10f15/flink-state-backends/flink-statebackend-rocksdb/src/main/java/org/apache/flink/contrib/streaming/state/snapshot/RocksIncrementalSnapshotStrategy.java#L319
Best
Lu
On Fri, Apr 17, 2020 at 11:31 AM Robert Metzger
<[email protected] <mailto:[email protected]>> wrote:
Hi,
did you check the TaskManager logs if there are
retries by the s3a file system during checkpointing?
I'm not aware of any metrics in Flink that could be
helpful in this situation.
Best,
Robert
On Tue, Apr 14, 2020 at 12:02 AM Lu Niu
<[email protected] <mailto:[email protected]>> wrote:
Hi, Flink users
We notice sometimes async checkpointing can be
extremely slow, leading to checkpoint timeout. For
example, For a state size around 2.5MB, it could
take 7~12min in async checkpointing:
Screen Shot 2020-04-09 at 5.04.30 PM.png
Notice all the slowness comes from async
checkpointing, no delay in sync part and barrier
assignment. As we use rocksdb incremental
checkpointing, I notice the slowness might be
caused by uploading the file to s3. However, I am
not completely sure since there are other steps in
async checkpointing. Does flink expose
fine-granular metrics to debug such slowness?
setup: flink 1.9.1, rocksdb incremental state
backend, S3AHaoopFileSystem
Best
Lu