No worries, thanks for the update! It's good to hear that it worked for you.
Best regards, Piotrek wt., 13 paź 2020 o 22:43 Binh Nguyen Van <binhn...@gmail.com> napisał(a): > Hi, > > Sorry for the late reply. It took me quite a while to change the JDK > version to reproduce the issue. I confirmed that if I upgrade to a newer > JDK version (I tried with JDK 1.8.0_265) the issue doesn’t happen. > > Thank you for helping > -Binh > > On Fri, Oct 9, 2020 at 11:36 AM Piotr Nowojski <pnowoj...@apache.org> > wrote: > >> Hi Binh, >> >> Could you try upgrading Flink's Java runtime? It was previously reported >> that upgrading to jdk1.8.0_251 was solving the problem. >> >> Piotrek >> >> pt., 9 paź 2020 o 19:41 Binh Nguyen Van <binhn...@gmail.com> napisał(a): >> >>> Hi, >>> >>> Thank you for helping me! >>> The code is compiled on >>> >>> java version "1.8.0_161" >>> Java(TM) SE Runtime Environment (build 1.8.0_161-b12) >>> Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode) >>> >>> But I just checked our Hadoop and its Java version is >>> >>> java version "1.8.0_77" >>> Java(TM) SE Runtime Environment (build 1.8.0_77-b03) >>> Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode) >>> >>> Thanks >>> -Binh >>> >>> On Fri, Oct 9, 2020 at 10:23 AM Piotr Nowojski <pnowoj...@apache.org> >>> wrote: >>> >>>> Hi, >>>> >>>> One more thing. It looks like it's not a Flink issue, but some JDK bug. >>>> Others reported that upgrading JDK version (for example to jdk1.8.0_251) >>>> seemed to be solving this problem. What JDK version are you using? >>>> >>>> Piotrek >>>> >>>> pt., 9 paź 2020 o 17:59 Piotr Nowojski <pnowoj...@apache.org> >>>> napisał(a): >>>> >>>>> Hi, >>>>> >>>>> Thanks for reporting the problem. I think this is a known issue [1] on >>>>> which we are working to fix. >>>>> >>>>> Piotrek >>>>> >>>>> [1] https://issues.apache.org/jira/browse/FLINK-18196 >>>>> >>>>> pon., 5 paź 2020 o 08:54 Binh Nguyen Van <binhn...@gmail.com> >>>>> napisał(a): >>>>> >>>>>> Hi, >>>>>> >>>>>> I have a streaming job that is written in Apache Beam and uses Flink >>>>>> as its runner. The job is working as expected for about 15 hours and then >>>>>> it started to have checkpointing error. The error message looks like this >>>>>> >>>>>> java.lang.Exception: Could not perform checkpoint 910 for operator >>>>>> Source: <source-name> (8/60). >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760) >>>>>> at java.util.concurrent.FutureTask.run(FutureTask.java:266) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469) >>>>>> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708) >>>>>> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533) >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> Caused by: java.lang.NullPointerException >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843) >>>>>> at >>>>>> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776) >>>>>> ... 11 more >>>>>> >>>>>> When this happened, I have to stop the job and then start it again, >>>>>> and then 15 hours later the issue happens again. >>>>>> >>>>>> Here are some additional information >>>>>> >>>>>> - Flink version is 1.10.1 >>>>>> - Job reads data from Kafka, transform, and then writes to Kafka >>>>>> - There are 6 tasks with the parallelism of 60 each (each task >>>>>> reads from 1 Kafka topic) >>>>>> - The job is deployed to run on YARN with 60 task managers and >>>>>> each task manager has 1 slot >>>>>> - The State backend is filesystem and HDFS is the storage >>>>>> (Doesn’t seem to related to the type of state backend since the issue >>>>>> also >>>>>> happened when I use memory as the state backend) >>>>>> - The checkpointing interval is 60 seconds (The longest duration >>>>>> of the normal checkpoint as shown in Flink UI is 14 seconds) >>>>>> - The minimum pause between checkpoints is 30 seconds >>>>>> - Hadoop cluster is Kerberized but Kafka is not. Keytab and >>>>>> principal are set in the Flink configuration file >>>>>> >>>>>> Can someone please help? >>>>>> >>>>>> Thanks >>>>>> -Binh >>>>>> >>>>>