Hi, As the follow up, it seem that savepoint can't be subsumed, so that its notification could still be send to each TMs. Is this a bug that need to be fixed in TwoPhaseCommitSinkFunction?
Best, Tony Wei Tony Wei <tony19920...@gmail.com> 於 2019年11月27日 週三 下午3:43寫道: > Hi, > > I want to raise this question again, since I have had this exception on my > production job. > > The exception is as follows > > >> 2019-11-27 14:47:29 > > > > java.lang.RuntimeException: Error while confirming checkpoint >> at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1205) >> at java.util.concurrent.Executors$RunnableAdapter.call(Executors >> .java:511) >> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >> at java.util.concurrent.ThreadPoolExecutor.runWorker( >> ThreadPoolExecutor.java:1149) >> at java.util.concurrent.ThreadPoolExecutor$Worker.run( >> ThreadPoolExecutor.java:624) >> at java.lang.Thread.run(Thread.java:748) >> Caused by: java.lang.IllegalStateException: checkpoint completed, but no >> transaction pending >> at org.apache.flink.util.Preconditions.checkState(Preconditions.java: >> 195) >> at org.apache.flink.streaming.api.functions.sink. >> TwoPhaseCommitSinkFunction.notifyCheckpointComplete( >> TwoPhaseCommitSinkFunction.java:267) >> at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator >> .notifyCheckpointComplete(AbstractUdfStreamOperator.java:130) >> at org.apache.flink.streaming.runtime.tasks.StreamTask >> .notifyCheckpointComplete(StreamTask.java:822) >> at org.apache.flink.runtime.taskmanager.Task$2.run(Task.java:1200) >> ... 5 more > > > And these are the checkpoint / savepoint before the job failed. > [image: checkoint.png] > > It seems that checkpoint # 675's notification handled the savepoint # > 674's pending transaction holder, but savepoint #674's notification didn't > be subsumed or be ignored by JM. > Therefore, during the checkpoint #676, some tasks got notification before > getting the checkpoint barrier and led to this exception happened, because > there was no pending transaction in queue. > > Does anyone know the details about subsumed notifications mechanism and > how checkpoint coordinator handle this situation? Please correct me if I'm > wrong. Thanks. > > Best, > Tony Wei > > Stefan Richter <s.rich...@data-artisans.com> 於 2018年10月8日 週一 下午5:03寫道: > >> Hi Pedro, >> >> unfortunately the interesting parts are all removed from the log, we >> already know about the exception itself. In particular, what I would like >> to see is what checkpoints have been triggered and completed before the >> exception happens. >> >> Best, >> Stefan >> >> > Am 08.10.2018 um 10:23 schrieb PedroMrChaves <pedro.mr.cha...@gmail.com >> >: >> > >> > Hello, >> > >> > Find attached the jobmanager.log. I've omitted the log lines from other >> > runs, only left the job manager info and the run with the error. >> > >> > jobmanager.log >> > < >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/file/t612/jobmanager.log> >> >> > >> > >> > >> > Thanks again for your help. >> > >> > Regards, >> > Pedro. >> > >> > >> > >> > ----- >> > Best Regards, >> > Pedro Chaves >> > -- >> > Sent from: >> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/ >> >>