Hi: I am using Flink streaming continuous query. Scenario: Kafka-connector to consume a topic, and streaming incremental calculate 24 hours window data. And use processingTime as TimeCharacteristic. I am using RocksDB as StateBackend, file system is HDFS, and checkpoint interval is 5 minutes.
env.setStreamTimeCharacteristic(TimeCharacteristic.ProcessingTime); RocksDBStateBackend rocksdb = new RocksDBStateBackend(checkPointPath, true); rocksdb.setPredefinedOptions(PredefinedOptions.FLASH_SSD_OPTIMIZED); env.setStateBackend(rocksdb); env.enableCheckpointing(checkPointInterval); env.getCheckpointConfig().setMinPauseBetweenCheckpoints(checkPointInterval); After I run the application for serval hours, the info log shows 2018-06-04 19:29:08,048 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 9 @ 1528108147985 for job 33313f186b439312bd09e5672e8af661. but not completed log, and the checkpoint failed Via web UI metrics, kafka commit offset stop increase, and kafka current offset still go ahead, wait for 2 hours, kafka stop consume message. Then I enable debug log, and try to reproduce the issue, During normal stage, the log shows there are DFSClient send data package 2018-06-04 19:23:58,933 DEBUG org.apache.hadoop.hdfs.DFSClient - /flink/cps/33313f186b439312bd09e5672e8af661/chk-8: masked=rwxr-xr-x 2018-06-04 19:23:58,934 DEBUG org.apache.hadoop.ipc.Client - The ping interval is 60000 ms. 2018-06-04 19:23:58,934 DEBUG org.apache.hadoop.ipc.Client - Connecting to fds-hadoop-prod30-mp/10.10.22.50:8020 2018-06-04 19:23:58,935 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: starting, having connections 1 2018-06-04 19:23:58,936 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1709 2018-06-04 19:23:58,967 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1709 2018-06-04 19:23:58,967 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: mkdirs took 33ms 2018-06-04 19:23:58,967 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 8 @ 1528107838933 for job 33313f186b439312bd09e5672e8af661. 2018-06-04 19:24:00,054 DEBUG org.apache.hadoop.hdfs.DFSClient - /flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata: masked=rw-r--r-- 2018-06-04 19:24:00,055 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1710 2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1710 2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: create took 6ms 2018-06-04 19:24:00,060 DEBUG org.apache.hadoop.hdfs.DFSClient - computePacketChunkSize: src=/flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata, chunkSize=516, chunksPerPacket=126, packetSize=65016 2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.LeaseRenewer - Lease renewer daemon for [DFSClient_NONMAPREDUCE_-866487647_111] with renew id 1 started 2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient - DFSClient writeChunk allocating new packet seqno=0, src=/flink/cps/33313f186b439312bd09e5672e8af661/chk-8/_metadata, packetSize=65016, chunksPerPacket=126, bytesCurBlock=0 2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient - DFSClient flush(): bytesCurBlock=6567 lastFlushOffset=0 createNewBlock=false 2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient - Queued packet 0 2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient - Waiting for ack for: 0 2018-06-04 19:24:00,061 DEBUG org.apache.hadoop.hdfs.DFSClient - Allocating new block 2018-06-04 19:24:00,062 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1711 2018-06-04 19:24:00,068 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1711 2018-06-04 19:24:00,068 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: addBlock took 6ms 2018-06-04 19:24:00,069 DEBUG org.apache.hadoop.hdfs.DFSClient - pipeline = DatanodeInfoWithStorage[10.10.22.51:50010,DS-a67bc965-fb34-4d18-bfb6-4771a485ae84,DISK] 2018-06-04 19:24:00,069 DEBUG org.apache.hadoop.hdfs.DFSClient - pipeline = DatanodeInfoWithStorage[10.10.22.52:50010,DS-37e4ccb5-9ea1-4e11-90c4-7052aec50282,DISK] 2018-06-04 19:24:00,069 DEBUG org.apache.hadoop.hdfs.DFSClient - pipeline = DatanodeInfoWithStorage[10.10.9.21:50010,DS-129977b5-a5c7-4e2f-827a-e4ef9fa22647,DISK] 2018-06-04 19:24:00,069 DEBUG org.apache.hadoop.hdfs.DFSClient - Connecting to datanode 10.10.22.51:50010 2018-06-04 19:24:00,069 DEBUG org.apache.hadoop.hdfs.DFSClient - Send buf size 124928 And after send data package, there are some IPC operations: 2018-06-04 19:24:00,507 DEBUG org.apache.hadoop.hdfs.DFSClient - DFSClient seqno: 866 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 223700 2018-06-04 19:24:00,507 DEBUG org.apache.hadoop.hdfs.DFSClient - DataStreamer block BP-740685118-10.10.22.50-1493792592687:blk_1160534861_86796613 sending packet packet seqno: 867 offsetInBlock: 119274 lastPacketInBlock: true lastByteOffsetInBlock: 119274 2018-06-04 19:24:00,511 DEBUG org.apache.hadoop.hdfs.DFSClient - DFSClient seqno: 867 reply: SUCCESS reply: SUCCESS reply: SUCCESS downstreamAckTimeNanos: 2367379 2018-06-04 19:24:00,511 DEBUG org.apache.hadoop.hdfs.DFSClient - Closing old block BP-740685118-10.10.22.50-1493792592687:blk_1160534861_86796613 2018-06-04 19:24:00,511 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1713 2018-06-04 19:24:00,547 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1713 2018-06-04 19:24:00,547 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: complete took 36ms 2018-06-04 19:24:00,548 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1714 2018-06-04 19:24:00,571 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1714 2018-06-04 19:24:00,571 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: delete took 23ms 2018-06-04 19:24:00,571 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1715 2018-06-04 19:24:00,611 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1715 2018-06-04 19:24:00,612 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: delete took 41ms 2018-06-04 19:24:00,612 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1716 2018-06-04 19:24:00,635 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1716 2018-06-04 19:24:00,635 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: delete took 23ms The abnormal stage shows there are not any data to send, and following checkpoint interval became to 10 minutes, and not data to send to HDFS 2018-06-04 19:24:07,985 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Completed checkpoint 8 for job 33313f186b439312bd09e5672e8af661 (7080351 bytes in 1615 ms). 2018-06-04 19:24:17,919 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: closed 2018-06-04 19:24:17,920 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: stopped, remaining connections 0 2018-06-04 19:24:30,064 DEBUG org.apache.hadoop.hdfs.LeaseRenewer - Lease renewer daemon for [] with renew id 1 executed 2018-06-04 19:25:00,068 DEBUG org.apache.hadoop.hdfs.LeaseRenewer - Lease renewer daemon for [] with renew id 1 executed 2018-06-04 19:25:01,068 DEBUG org.apache.hadoop.hdfs.LeaseRenewer - Lease renewer daemon for [] with renew id 1 expired 2018-06-04 19:25:01,068 DEBUG org.apache.hadoop.hdfs.LeaseRenewer - Lease renewer daemon for [] with renew id 1 exited 2018-06-04 19:29:07,986 DEBUG org.apache.hadoop.hdfs.DFSClient - /flink/cps/33313f186b439312bd09e5672e8af661/chk-9: masked=rwxr-xr-x 2018-06-04 19:29:07,986 DEBUG org.apache.hadoop.ipc.Client - The ping interval is 60000 ms. 2018-06-04 19:29:07,986 DEBUG org.apache.hadoop.ipc.Client - Connecting to fds-hadoop-prod30-mp/10.10.22.50:8020 2018-06-04 19:29:07,987 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: starting, having connections 1 2018-06-04 19:29:07,988 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1990 2018-06-04 19:29:08,048 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin got value #1990 2018-06-04 19:29:08,048 DEBUG org.apache.hadoop.ipc.ProtobufRpcEngine - Call: mkdirs took 62ms 2018-06-04 19:29:08,048 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Triggering checkpoint 9 @ 1528108147985 for job 33313f186b439312bd09e5672e8af661. 2018-06-04 19:29:17,988 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: closed 2018-06-04 19:29:17,988 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: stopped, remaining connections 0 2018-06-04 19:39:08,049 INFO org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint 9 of job 33313f186b439312bd09e5672e8af661 expired before completing. 2018-06-04 19:39:08,050 DEBUG org.apache.hadoop.hdfs.DFSClient - /flink/cps/33313f186b439312bd09e5672e8af661/chk-10: masked=rwxr-xr-x 2018-06-04 19:39:08,050 DEBUG org.apache.hadoop.ipc.Client - The ping interval is 60000 ms. 2018-06-04 19:39:08,051 DEBUG org.apache.hadoop.ipc.Client - Connecting to fds-hadoop-prod30-mp/10.10.22.50:8020 2018-06-04 19:39:08,053 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin: starting, having connections 1 2018-06-04 19:39:08,054 DEBUG org.apache.hadoop.ipc.Client - IPC Client (2045458324) connection to fds-hadoop-prod30-mp/10.10.22.50:8020 from fdsadmin sending #1991 Then I disable checkpoint, currently the application run for more than 10 hours, kafka consume message as normal as expected. My link version is 1.5.0. Because of there are not any exception/error information in log, could anyone give me some suggestion. Thanks James