Re: Exceeded Checkpoint tolerable failure

Hangxiang Yu Sun, 11 Dec 2022 22:15:58 -0800

Hi, Madan.
5s may be too small for checkpoint timeout configuration.
I see the timeout is related to back pressure as you said. You may also
find the metric of "start delay" in 1.14 is longer than one in 1.9.
I'd like to suggest that we increase the configuration of checkpoint
timeout and compare the performance difference between different versions
firstly.
If all configurations / data source / ... are the same, but 1.9 runs
faster, I think it deserves delving into this case.



On Fri, Dec 9, 2022 at 11:41 AM Madan D <madan_de...@yahoo.com.au> wrote:

> Hi Hangxiang,
>
> Thanks for your response.
>
> I see its happening due to back pressure but same configuration worked
> before upgrade ( which was 1.9.0)
> And We are setting tolerable checkpoint failures to 4 at application level
> (execEnv.getCheckpointConfig().setTolerableCheckpointFailureNumber(4))
>
> Attached images for reference.
>
>
>
> Regards,
> Madan
>
> On Thursday, 8 December 2022 at 06:29:49 pm GMT-8, Hangxiang Yu <
> master...@gmail.com> wrote:
>
>
> Hi, Madan.
> I think there is a root cause of the exception, could you share it ?
> BTW, If you don't set a value for
> execution.checkpointing.tolerable-failed-checkpoints, I'd recommend you
> to set it which could avoid job restart due to some recoverable temporary
> problems.
>
> [1]
> https://nightlies.apache.org/flink/flink-docs-release-1.16/docs/deployment/config/#execution-checkpointing-tolerable-failed-checkpoints
>
> On Thu, Dec 8, 2022 at 11:41 AM Madan D via user <user@flink.apache.org>
> wrote:
>
> Hello All,
> I am seeing below issue after I upgraded from 1.9.0 to 1.14.2 while
> publishing messages to pub sub which is causing frequent job restart and
> slow processing.
>
> Can you please help me.
>
> `Caused by: org.apache.flink.util.FlinkRuntimeException: Exceeded
> checkpoint tolerable failure threshold.
>    at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleCheckpointException(CheckpointFailureManager.java:98)
>    at
> org.apache.flink.runtime.checkpoint.CheckpointFailureManager.handleJobLevelCheckpointException(CheckpointFailureManager.java:67)
>    at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1940)
>    at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.abortPendingCheckpoint(CheckpointCoordinator.java:1912)
>    at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator.access$600(CheckpointCoordinator.java:98)
>    at
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator$CheckpointCanceller.run(CheckpointCoordinator.java:1996)
>    at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>    at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>    at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>    at java.util.concurrent.ThreadPoolExecut
>
> Regards,
> Madan
>
>
>
>
> --
> Best,
> Hangxiang.
>


-- 
Best,
Hangxiang.

Re: Exceeded Checkpoint tolerable failure

Reply via email to