Re: NPE when checkpointing

Binh Nguyen Van Fri, 09 Oct 2020 10:41:31 -0700

Hi,

Thank you for helping me!
The code is compiled on


java version "1.8.0_161"
Java(TM) SE Runtime Environment (build 1.8.0_161-b12)
Java HotSpot(TM) 64-Bit Server VM (build 25.161-b12, mixed mode)

But I just checked our Hadoop and its Java version is

java version "1.8.0_77"
Java(TM) SE Runtime Environment (build 1.8.0_77-b03)
Java HotSpot(TM) 64-Bit Server VM (build 25.77-b03, mixed mode)

Thanks
-Binh

On Fri, Oct 9, 2020 at 10:23 AM Piotr Nowojski <[email protected]> wrote:

> Hi,
>
> One more thing. It looks like it's not a Flink issue, but some JDK bug.
> Others reported that upgrading JDK version (for example to  jdk1.8.0_251)
> seemed to be solving this problem. What JDK version are you using?
>
> Piotrek
>
> pt., 9 paź 2020 o 17:59 Piotr Nowojski <[email protected]> napisał(a):
>
>> Hi,
>>
>> Thanks for reporting the problem. I think this is a known issue [1] on
>> which we are working to fix.
>>
>> Piotrek
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-18196
>>
>> pon., 5 paź 2020 o 08:54 Binh Nguyen Van <[email protected]> napisał(a):
>>
>>> Hi,
>>>
>>> I have a streaming job that is written in Apache Beam and uses Flink as
>>> its runner. The job is working as expected for about 15 hours and then it
>>> started to have checkpointing error. The error message looks like this
>>>
>>> java.lang.Exception: Could not perform checkpoint 910 for operator Source: 
>>> <source-name> (8/60).
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>>>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>>>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>>>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>>>     at java.lang.Thread.run(Thread.java:745)
>>> Caused by: java.lang.NullPointerException
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>>>     at 
>>> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>>>     ... 11 more
>>>
>>> When this happened, I have to stop the job and then start it again, and
>>> then 15 hours later the issue happens again.
>>>
>>> Here are some additional information
>>>
>>>    - Flink version is 1.10.1
>>>    - Job reads data from Kafka, transform, and then writes to Kafka
>>>    - There are 6 tasks with the parallelism of 60 each (each task reads
>>>    from 1 Kafka topic)
>>>    - The job is deployed to run on YARN with 60 task managers and each
>>>    task manager has 1 slot
>>>    - The State backend is filesystem and HDFS is the storage (Doesn’t
>>>    seem to related to the type of state backend since the issue also 
>>> happened
>>>    when I use memory as the state backend)
>>>    - The checkpointing interval is 60 seconds (The longest duration of
>>>    the normal checkpoint as shown in Flink UI is 14 seconds)
>>>    - The minimum pause between checkpoints is 30 seconds
>>>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and
>>>    principal are set in the Flink configuration file
>>>
>>> Can someone please help?
>>>
>>> Thanks
>>> -Binh
>>>
>>

Re: NPE when checkpointing

Reply via email to