Hi
你 flink 是什么版本,以及你作业 checkpoint/state 相关的配置是什么呢?如果可以的话,把完整的 jm log 发一下
Best,
Congxian
chen310 <[email protected]> 于2021年2月1日周一 下午5:41写道:
> 补充下,jobmanager日志异常:
>
> 2021-02-01 08:54:43,639 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:44,642 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:45,644 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:46,647 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:47,649 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:48,652 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:49,655 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:50,658 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:50,921 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] -
> Triggering
> checkpoint 8697 (type=CHECKPOINT) @ 1612169690917 for job
> 1299f2f27e56ec36a4e0ffd3472ad399.
> 2021-02-01 08:54:50,999 INFO
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator [] - Decline
> checkpoint 8697 by task 320d2c162f17265435777bb65e1a8934 of job
> 1299f2f27e56ec36a4e0ffd3472ad399 at
> container_e21_1596002540781_1159_01_000134 @
> ip-10-120-83-22.ap-northeast-1.compute.internal (dataPort=42984).
> 2021-02-01 08:54:51,661 ERROR
> org.apache.flink.runtime.rest.handler.job.JobDetailsHandler [] - Exception
> occurred in REST handler: Job 65892aaedb8064e5743f04b54b5380df not found
> 2021-02-01 08:54:52,654 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
> GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime,
> 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime],
> select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$)
> AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) ->
> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd
> HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1)
> (6beee54a923323c369b046e199f572c4) switched from RUNNING to FAILED on
> org.apache.flink.runtime.jobmaster.slotpool.SingleLogicalSlot@379a8f9c.
> java.io.IOException: Could not perform checkpoint 8697 for operator
> GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime,
> 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime],
> select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$)
> AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) ->
> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd
> HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1).
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:897)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.streaming.runtime.io
> .CheckpointBarrierHandler.notifyCheckpoint(CheckpointBarrierHandler.java:113)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.streaming.runtime.io
> .CheckpointBarrierAligner.processBarrier(CheckpointBarrierAligner.java:137)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.streaming.runtime.io
> .CheckpointedInputGate.pollNext(CheckpointedInputGate.java:93)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.streaming.runtime.io
> .StreamTaskNetworkInput.emitNext(StreamTaskNetworkInput.java:158)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
> org.apache.flink.streaming.runtime.io
> .StreamOneInputProcessor.processInput(StreamOneInputProcessor.java:67)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.processInput(StreamTask.java:351)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxStep(MailboxProcessor.java:191)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:181)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:567)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:536)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:721)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at org.apache.flink.runtime.taskmanager.Task.run(Task.java:546)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_181]
> Caused by: org.apache.flink.runtime.checkpoint.CheckpointException: Could
> not complete snapshot 8697 for operator
> GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime,
> 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime],
> select=[COUNT(DISTINCT $f1) AS totalCount, start('w$) AS w$start, end('w$)
> AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS w$proctime]) ->
> Calc(select=[(UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd
> HH:mm:ss')) * 1000) AS requestTime, totalCount]) (1/1). Failure reason:
> Checkpoint was declined.
> at
>
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:215)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> ... 13 more
> Caused by: org.apache.flink.util.SerializedThrowable: While open a file for
> appending:
>
> /server/yarn/nm/usercache/yarn/appcache/application_1596002540781_1159/flink-io-1ad6bdc6-aea8-4dc5-a133-7c7b5e2361fe/job_1299f2f27e56ec36a4e0ffd3472ad399_op_AggregateWindowOperator_fa157648fdadffa65122f5b4200f4fda__1_1__uuid_9744ef17-bf12-471c-b486-19140201517f/db/038968.sst:
> Too many open files
> at org.rocksdb.Checkpoint.createCheckpoint(Native Method)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at org.rocksdb.Checkpoint.createCheckpoint(Checkpoint.java:51)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.takeDBNativeCheckpoint(RocksIncrementalSnapshotStrategy.java:255)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.contrib.streaming.state.snapshot.RocksIncrementalSnapshotStrategy.doSnapshot(RocksIncrementalSnapshotStrategy.java:159)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.contrib.streaming.state.snapshot.RocksDBSnapshotStrategyBase.snapshot(RocksDBSnapshotStrategyBase.java:126)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.contrib.streaming.state.RocksDBKeyedStateBackend.snapshot(RocksDBKeyedStateBackend.java:459)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:198)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.api.operators.StreamOperatorStateHandler.snapshotState(StreamOperatorStateHandler.java:156)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.api.operators.AbstractStreamOperator.snapshotState(AbstractStreamOperator.java:314)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointStreamOperator(SubtaskCheckpointCoordinatorImpl.java:614)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.buildOperatorSnapshotFutures(SubtaskCheckpointCoordinatorImpl.java:540)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.takeSnapshotSync(SubtaskCheckpointCoordinatorImpl.java:507)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.SubtaskCheckpointCoordinatorImpl.checkpointState(SubtaskCheckpointCoordinatorImpl.java:266)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.lambda$performCheckpoint$8(StreamTask.java:926)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:47)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:916)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> at
>
> org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpointOnBarrier(StreamTask.java:884)
> ~[flink-dist_2.11-1.12-SNAPSHOT.jar:1.12-SNAPSHOT]
> ... 13 more
> 2021-02-01 08:54:52,654 INFO
>
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - Calculating tasks to restart to recover the failed task
> fa157648fdadffa65122f5b4200f4fda_0.
> 2021-02-01 08:54:52,654 INFO
>
> org.apache.flink.runtime.executiongraph.failover.flip1.RestartPipelinedRegionFailoverStrategy
> [] - 7 tasks should be restarted to recover the failed task
> fa157648fdadffa65122f5b4200f4fda_0.
> 2021-02-01 08:54:52,654 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] - Job
>
> insert-into_default_catalog.default_database.risk_final_accept_sink,default_catalog.default_database.risk_final_accept_grafana_sink
> (1299f2f27e56ec36a4e0ffd3472ad399) switched from state RUNNING to
> RESTARTING.
> 2021-02-01 08:54:52,654 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph [] -
> GroupWindowAggregate(window=[SlidingGroupWindow('w$, requestDateTime,
> 1800000, 600000)], properties=[w$start, w$end, w$rowtime, w$proctime],
> select=[COUNT(DISTINCT merchantReferenceCode) AS acceptCount, start('w$) AS
> w$start, end('w$) AS w$end, rowtime('w$) AS w$rowtime, proctime('w$) AS
> w$proctime]) -> Calc(select=[_UTF-16LE'risk_final_accept_hop10min30min' AS
> eventCode, (w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS
> timeStart, (w$end DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss') AS timeEnd,
> (UNIX_TIMESTAMP((w$start DATE_FORMAT _UTF-16LE'yyyy-MM-dd HH:mm:ss')) *
> 1000) AS requestTime, _UTF-16LE'0' AS userId, acceptCount]) (1/1)
> (52f55328f6bf756dd1c63bb0d149e55b) switched from RUNNING to CANCELING.
>
>
>
>
> --
> Sent from: http://apache-flink.147419.n8.nabble.com/
>