Thanks for the reply. Well, tracing back to the root cause, I see the
following:

1. At the Job manager, the Checkpoint times are getting worse :

Jobmanager :

Checkpoint times are getting worse progressively.

2017-09-16 05:05:50,813 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 1 @ 1505538350809
2017-09-16 05:05:51,396 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 1 (11101233 bytes in 586 ms).
2017-09-16 05:07:30,809 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 2 @ 1505538450809
2017-09-16 05:07:31,657 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 2 (18070955 bytes in 583 ms).

                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
                                                          .
2017-09-16 07:32:58,117 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 89 (246125113 bytes in 27194 ms).
2017-09-16 07:34:10,809 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 90 @ 1505547250809
2017-09-16 07:34:44,932 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 90 (248272325 bytes in 34012 ms).
2017-09-16 07:35:50,809 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 91 @ 1505547350809
2017-09-16 07:36:37,058 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 91 (250348812 bytes in 46136 ms).
2017-09-16 07:37:30,809 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 92 @ 1505547450809
2017-09-16 07:38:18,076 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 92 (252399724 bytes in 47152 ms).
2017-09-16 07:39:10,809 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 93 @ 1505547550809
2017-09-16 07:40:13,494 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 93 (254374636 bytes in 62573 ms).
2017-09-16 07:40:50,809 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 94 @ 1505547650809
2017-09-16 07:42:42,850 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 94 (256386533 bytes in 111898 ms).
2017-09-16 07:42:42,850 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 95 @ 1505547762850
2017-09-16 07:46:06,241 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Completed
checkpoint 95 (258441766 bytes in 203268 ms).
2017-09-16 07:46:06,241 INFO
org.apache.flink.runtime.checkpoint.CheckpointCoordinator     - Triggering
checkpoint 96 @ 1505547966241
2017-09-16 07:48:42,069 INFO
org.apache.flink.runtime.executiongraph.ExecutionGraph        -
KeyedCEPPatternOperator -> Map (1/4) (ff835faa9eb9182ed2f2230a1e5cc56d)
switched from RUNNING to FAILED.
AsynchronousException{java.lang.Exception: Could not materialize checkpoint
96 for operator KeyedCEPPatternOperator -> Map (1/4).}
    at
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:970)
    at
java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.Exception: Could not materialize checkpoint 96 for
operator KeyedCEPPatternOperator -> Map (1/4).
    ... 6 more
Caused by: java.util.concurrent.ExecutionException:
java.lang.OutOfMemoryError: GC overhead limit exceeded
    at java.util.concurrent.FutureTask.report(FutureTask.java:122)
    at java.util.concurrent.FutureTask.get(FutureTask.java:192)
    at
org.apache.flink.util.FutureUtil.runIfNotDoneAndGet(FutureUtil.java:43)
    at
org.apache.flink.streaming.runtime.tasks.StreamTask$AsyncCheckpointRunnable.run(StreamTask.java:897)
    ... 5 more


So, it looks like the Job Manager ran out of memory, thanks to the
"Progressively Getting Worse" checkpoints. Any ideas on how to make sure
the checkpoints faster?






On Thu, Sep 21, 2017 at 7:29 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org>
wrote:

> Hi Sridhar,
>
> Sorry that this didn't get a response earlier.
>
> According to the trace, it seems like the job failed during the process,
> and
> when trying to automatically restore from a checkpoint, deserialization of
> a
> CEP `IterativeCondition` object failed. As far as I can tell, CEP operators
> are just using Java serialization on CEP `IterativeCondition` objects, so
> should not be related to the protobuf serializer that you are using.
>
> Is this still constantly happening for you?
>
> Cheers,
> Gordon
>
>
>
> --
> Sent from: http://apache-flink-user-mailing-list-archive.2336050.
> n4.nabble.com/
>

Reply via email to