Hi, I am using flink 1.4.2 with rocksdb as backend. I am using process function with timer on EventTime. For checkpointing I am using hdfs.
I am trying load testing so Iam reading kafka from beginning (aprox 7 days data with 50M events). My job gets stuck after aprox 20 min with no error. There after watermark do not progress and all checkpoint fails. Also When I try to cancel my job (using web UI) , it takes several minutes to finally gets cancelled. Also it makes Task manager down as well. There is no logs while my job hanged but while cancelling I get following error. / 2018-07-11 09:10:39,385 ERROR org.apache.flink.runtime.taskmanager.TaskManager - ============================================================== ====================== FATAL ======================= ============================================================== A fatal error occurred, forcing the TaskManager to shut down: Task 'process (3/6)' did not react to cancelling signal in the last 30 seconds, but is stuck in method: org.rocksdb.RocksDB.get(Native Method) org.rocksdb.RocksDB.get(RocksDB.java:810) org.apache.flink.contrib.streaming.state.RocksDBMapState.get(RocksDBMapState.java:102) org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47) nl.ing.gmtt.observer.analyser.job.rtpe.process.RtpeProcessFunction.onTimer(RtpeProcessFunction.java:94) org.apache.flink.streaming.api.operators.KeyedProcessOperator.onEventTime(KeyedProcessOperator.java:75) org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:288) org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:108) org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:885) org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189) org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) java.lang.Thread.run(Thread.java:748) 2018-07-11 09:10:39,390 DEBUG org.apache.flink.runtime.akka.StoppingSupervisorWithoutLoggingActorKilledExceptionStrategy - Actor was killed. Stopping it now. akka.actor.ActorKilledException: Kill 2018-07-11 09:10:39,407 INFO org.apache.flink.runtime.taskmanager.TaskManager - Stopping TaskManager akka://flink/user/taskmanager#-1231617791. 2018-07-11 09:10:39,408 INFO org.apache.flink.runtime.taskmanager.TaskManager - Cancelling all computations and discarding all cached data. 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally process (3/6) (432fd129f3eea363334521f8c8de5198). 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.Task - Task process (3/6) is already in state CANCELING 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally process (4/6) (7c6b96c9f32b067bdf8fa7c283eca2e0). 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.Task - Task process (4/6) is already in state CANCELING 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally process (2/6) (a4f731797a7ea210fd0b512b0263bcd9). 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.Task - Task process (2/6) is already in state CANCELING 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.Task - Attempting to fail task externally process (1/6) (cd8a113779a4c00a051d78ad63bc7963). 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.Task - Task process (1/6) is already in state CANCELING 2018-07-11 09:10:39,409 INFO org.apache.flink.runtime.taskmanager.TaskManager - Disassociating from JobManager 2018-07-11 09:10:39,412 INFO org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down BLOB cache 2018-07-11 09:10:39,431 INFO org.apache.flink.runtime.blob.TransientBlobCache - Shutting down BLOB cache 2018-07-11 09:10:39,444 INFO org.apache.flink.runtime.leaderretrieval.ZooKeeperLeaderRetrievalService - Stopping ZooKeeperLeaderRetrievalService. 2018-07-11 09:10:39,444 DEBUG org.apache.flink.runtime.io.disk.iomanager.IOManager - Shutting down I/O manager. 2018-07-11 09:10:39,451 INFO org.apache.flink.runtime.io.disk.iomanager.IOManager - I/O manager removed spill file directory /tmp/flink-io-989505e5-ac33-4d56-add5-04b2ad3067b4 2018-07-11 09:10:39,461 INFO org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting down the network environment and its components. 2018-07-11 09:10:39,461 DEBUG org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting down network connection manager 2018-07-11 09:10:39,462 INFO org.apache.flink.runtime.io.network.netty.NettyClient - Successful shutdown (took 1 ms). 2018-07-11 09:10:39,472 INFO org.apache.flink.runtime.io.network.netty.NettyServer - Successful shutdown (took 10 ms). 2018-07-11 09:10:39,472 DEBUG org.apache.flink.runtime.io.network.NetworkEnvironment - Shutting down intermediate result partition manager 2018-07-11 09:10:39,473 DEBUG org.apache.flink.runtime.io.network.partition.ResultPartitionManager - Releasing 0 partitions because of shutdown. 2018-07-11 09:10:39,474 DEBUG org.apache.flink.runtime.io.network.partition.ResultPartitionManager - Successful shutdown. 2018-07-11 09:10:39,498 INFO org.apache.flink.runtime.taskmanager.TaskManager - Task manager akka://flink/user/taskmanager is completely shut down. 2018-07-11 09:10:39,504 ERROR org.apache.flink.runtime.taskmanager.TaskManager - Actor akka://flink/user/taskmanager#-1231617791 terminated, stopping process... 2018-07-11 09:10:39,563 INFO org.apache.flink.runtime.taskmanager.Task - Notifying TaskManager about fatal error. Task 'process (2/6)' did not react to cancelling signal in the last 30 seconds, but is stuck in method: org.rocksdb.RocksDB.get(Native Method) org.rocksdb.RocksDB.get(RocksDB.java:810) org.apache.flink.contrib.streaming.state.RocksDBMapState.get(RocksDBMapState.java:102) org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47) nl.ing.gmtt.observer.analyser.job.rtpe.process.RtpeProcessFunction.onTimer(RtpeProcessFunction.java:94) org.apache.flink.streaming.api.operators.KeyedProcessOperator.onEventTime(KeyedProcessOperator.java:75) org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:288) org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:108) org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:885) org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189) org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) java.lang.Thread.run(Thread.java:748) . 2018-07-11 09:10:39,575 INFO org.apache.flink.runtime.taskmanager.Task - Notifying TaskManager about fatal error. Task 'process (1/6)' did not react to cancelling signal in the last 30 seconds, but is stuck in method: sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) java.lang.reflect.Constructor.newInstance(Constructor.java:423) java.lang.Class.newInstance(Class.java:442) org.apache.flink.api.java.typeutils.runtime.PojoSerializer.createInstance(PojoSerializer.java:196) org.apache.flink.api.java.typeutils.runtime.PojoSerializer.deserialize(PojoSerializer.java:399) org.apache.flink.contrib.streaming.state.RocksDBMapState.deserializeUserValue(RocksDBMapState.java:304) org.apache.flink.contrib.streaming.state.RocksDBMapState.get(RocksDBMapState.java:104) org.apache.flink.runtime.state.UserFacingMapState.get(UserFacingMapState.java:47) nl.ing.gmtt.observer.analyser.job.rtpe.process.RtpeProcessFunction.onTimer(RtpeProcessFunction.java:94) org.apache.flink.streaming.api.operators.KeyedProcessOperator.onEventTime(KeyedProcessOperator.java:75) org.apache.flink.streaming.api.operators.HeapInternalTimerService.advanceWatermark(HeapInternalTimerService.java:288) org.apache.flink.streaming.api.operators.InternalTimeServiceManager.advanceWatermark(InternalTimeServiceManager.java:108) org.apache.flink.streaming.api.operators.AbstractStreamOperator.processWatermark(AbstractStreamOperator.java:885) org.apache.flink.streaming.runtime.io.StreamInputProcessor$ForwardingValveOutputHandler.handleWatermark(StreamInputProcessor.java:288) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.findAndOutputNewMinWatermarkAcrossAlignedChannels(StatusWatermarkValve.java:189) org.apache.flink.streaming.runtime.streamstatus.StatusWatermarkValve.inputWatermark(StatusWatermarkValve.java:111) org.apache.flink.streaming.runtime.io.StreamInputProcessor.processInput(StreamInputProcessor.java:189) org.apache.flink.streaming.runtime.tasks.OneInputStreamTask.run(OneInputStreamTask.java:69) org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:264) org.apache.flink.runtime.taskmanager.Task.run(Task.java:718) java.lang.Thread.run(Thread.java:748) / -- Sent from: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/