Hi,

Thanks for reporting the problem. I think this is a known issue [1] on
which we are working to fix.

Piotrek

[1] https://issues.apache.org/jira/browse/FLINK-18196

pon., 5 paź 2020 o 08:54 Binh Nguyen Van <binhn...@gmail.com> napisał(a):

> Hi,
>
> I have a streaming job that is written in Apache Beam and uses Flink as
> its runner. The job is working as expected for about 15 hours and then it
> started to have checkpointing error. The error message looks like this
>
> java.lang.Exception: Could not perform checkpoint 910 for operator Source: 
> <source-name> (8/60).
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:785)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$triggerCheckpointAsync$3(StreamTask.java:760)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.run(StreamTaskActionExecutor.java:87)
>     at org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:78)
>     at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:261)
>     at 
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:186)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:485)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:469)
>     at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:708)
>     at org.apache.flink.runtime.taskmanager.Task.run(Task.java:533)
>     at java.lang.Thread.run(Thread.java:745)
> Caused by: java.lang.NullPointerException
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:1394)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:974)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$5(StreamTask.java:870)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$SynchronizedStreamTaskActionExecutor.runThrowing(StreamTaskActionExecutor.java:94)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:843)
>     at 
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:776)
>     ... 11 more
>
> When this happened, I have to stop the job and then start it again, and
> then 15 hours later the issue happens again.
>
> Here are some additional information
>
>    - Flink version is 1.10.1
>    - Job reads data from Kafka, transform, and then writes to Kafka
>    - There are 6 tasks with the parallelism of 60 each (each task reads
>    from 1 Kafka topic)
>    - The job is deployed to run on YARN with 60 task managers and each
>    task manager has 1 slot
>    - The State backend is filesystem and HDFS is the storage (Doesn’t
>    seem to related to the type of state backend since the issue also happened
>    when I use memory as the state backend)
>    - The checkpointing interval is 60 seconds (The longest duration of
>    the normal checkpoint as shown in Flink UI is 14 seconds)
>    - The minimum pause between checkpoints is 30 seconds
>    - Hadoop cluster is Kerberized but Kafka is not. Keytab and principal
>    are set in the Flink configuration file
>
> Can someone please help?
>
> Thanks
> -Binh
>

Reply via email to