Haoze Wu created FLINK-31746:
--------------------------------
Summary: Batch workload output completes while the job client fails
Key: FLINK-31746
URL: https://issues.apache.org/jira/browse/FLINK-31746
Project: Flink
Issue Type: Improvement
Affects Versions: 1.14.0
Reporter: Haoze Wu
We are doing testing on Flink-1.14.0 (We know 1.14.0 is not supported now so we
are also testing Flink-1.17.0 to see if it has the same issue). We run a batch
processing job. The input of the job is a file in the disk; the output of the
job is a Kafka topic, which should receive 170 messages when the workload
finishes. In the testing, we introduce a fault (an IOException) in a
taskmanager, then the batch processing job client fails:
{code:java}
2023-03-26T19:05:48,922 ERROR cli.CliFrontend
(CliFrontend.java:handleError(923)) - Error while running the
command.org.apache.flink.client.program.ProgramInvocationException: The main
method caused an error:
org.apache.flink.client.program.ProgramInvocationException: Job failed (JobID:
85c9bd56d6dd111f858b4b5a99551c53) {code}
The IOException occurs in
`BoundedBlockingSubpartitionDirectTransferReader$FileRegionReader` when running
`FileChannel.open`. It has multiple chances to occur in a workload.
{code:java}
FileRegionReader(Path filePath) throws IOException {
this.fileChannel = FileChannel.open(filePath, StandardOpenOption.READ);
this.headerBuffer = BufferReaderWriterUtil.allocatedHeaderBuffer();
}
{code}
The call stack of this fault site:
{code:java}
(org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartitionDirectTransferReader$FileRegionReader,<init>,200),
(org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartitionDirectTransferReader,<init>,74),
(org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition,createReadView,221),
(org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition,createSubpartitionView,205),
(org.apache.flink.runtime.io.network.partition.ResultPartitionManager,createSubpartitionView,76),
(org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel,requestSubpartition,133),
(org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate,internalRequestPartitions,330),
(org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate,requestPartitions,299),
(org.apache.flink.runtime.taskmanager.InputGateWithMetrics,requestPartitions,127),
(org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1,runThrowing,50),
(org.apache.flink.streaming.runtime.tasks.mailbox.Mail,run,90),
(org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor,processMailsNonBlocking,353),
(org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor,processMail,319),
(org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor,runMailboxLoop,201),
(org.apache.flink.streaming.runtime.tasks.StreamTask,runMailboxLoop,809),
(org.apache.flink.streaming.runtime.tasks.StreamTask,invoke,761),
(org.apache.flink.runtime.taskmanager.Task,runWithSystemExitMonitoring,958),
(org.apache.flink.runtime.taskmanager.Task,restoreAndInvoke,937),
(org.apache.flink.runtime.taskmanager.Task,doRun,766),
(org.apache.flink.runtime.taskmanager.Task,run,575),
(java.lang.Thread,run,748) {code}
We inspect the name of the threads where the fault occurs, we find that our
workload can be divided into these tasks:
Split Reader: Custom File Source -> Flat Map (1/8)#0
...
Split Reader: Custom File Source -> Flat Map (8/8)#0
Keyed Aggregation -> Map -> Sink Unnamed Writer (1/8)#0
...
Keyed Aggregation -> Map -> Sink Unnamed Writer (8/8)#0
Sink Unnamed Committer (1/1)#0
Such fault during “Split Reader” or “Keyed Aggregation” will trigger this “Job
failed” message and our Kafka topic can’t receive the complete correct output
(i.e., less than 170 messages). However, if the exception happens during “Sink
Unnamed Committer”, the client still recognizes the “Job failed”, while our
Kafka topic already completely got what it wants.
We assume that our workload is translated into a few steps: “Custom File Source
-> Flat Map”, “Keyed Aggregation -> Map -> Sink Unnamed Writer”, and “Sink
Unnamed Committer”. The last one is responsible for some “commit” for it does
not affect our end-to-end results. However, the fault in the “commit” stage
still reports a “failure” to the job client, while the job client may get
confused.
We have some questions about the design rationales:
# In some workloads such as our case, the “commit” at last seems not to matter
that much. Can it be seen as tolerable?
# The client log is confusing. It shows tons of exceptions but it does not
show in which stage of the workload the failure happens. The most useful
information for the client is something like “Sink Unnamed Committer (1/1)#0
(7b19f0a2f247b8f38fe9141c9872ef58) switched from RUNNING to FAILED”, which is
not shown.
P.S. The complete failure log of the job client is:
{code:java}
2023-04-03T11:36:25,464 ERROR cli.CliFrontend
(CliFrontend.java:handleError(923)) - Error while running the comm
and.
org.apache.flink.client.program.ProgramInvocationException: The main method
caused an error: org.apache.flink.cl
ient.program.ProgramInvocationException: Job failed (JobID:
8a169709de74948b5a9fed7d52c13f8d)
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:372)
~[flink-dist
_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.client.program.PackagedProgram.invokeInteractiveModeForExecution(PackagedProgram.jav
a:222) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.client.ClientUtils.executeProgram(ClientUtils.java:114)
~[flink-dist_2.11-1.14.0.jar
:1.14.0]
at
org.apache.flink.client.cli.CliFrontend.executeProgram(CliFrontend.java:812)
~[flink-dist_2.11-1.14.0
.jar:1.14.0]
at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:246)
~[flink-dist_2.11-1.14.0.jar:1.14.0
]
at
org.apache.flink.client.cli.CliFrontend.parseAndRun(CliFrontend.java:1054)
~[flink-dist_2.11-1.14.0.j
ar:1.14.0]
at
org.apache.flink.client.cli.CliFrontend.lambda$main$10(CliFrontend.java:1132)
~[flink-dist_2.11-1.14.
0.jar:1.14.0]
at
org.apache.flink.runtime.security.contexts.NoOpSecurityContext.runSecured(NoOpSecurityContext.java:28
) [flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1132)
[flink-dist_2.11-1.14.0.jar:1.14.
0]
Caused by: java.util.concurrent.ExecutionException:
org.apache.flink.client.program.ProgramInvocationException:
Job failed (JobID: 8a169709de74948b5a9fed7d52c13f8d)
at
java.util.concurrent.CompletableFuture.reportGet(CompletableFuture.java:357)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1895)
~[?:1.8.0_221]
at
org.apache.flink.client.program.StreamContextEnvironment.getJobExecutionResult(StreamContextEnvironme
nt.java:123) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.client.program.StreamContextEnvironment.execute(StreamContextEnvironment.java:80)
~[
flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.streaming.api.environment.StreamExecutionEnvironment.execute(StreamExecutionEnvironm
ent.java:1917) ~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
edu.jhu.order.mcgray.fl_1_14_0.FlinkGrayBatchClientMain.run(FlinkGrayBatchClientMain.java:69)
~[?:?]
at
edu.jhu.order.mcgray.fl_1_14_0.FlinkGrayClientMain.run(FlinkGrayClientMain.java:66)
~[?:?]
at
edu.jhu.order.mcgray.fl_1_14_0.FlinkGrayClientMain.main(FlinkGrayClientMain.java:92)
~[?:?]
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
~[?:1.8.0_221]
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
~[?:1.8.0_221]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_221]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_221]
at
org.apache.flink.client.program.PackagedProgram.callMainMethod(PackagedProgram.java:355)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
... 8 more
Caused by: org.apache.flink.client.program.ProgramInvocationException: Job
failed (JobID: 8a169709de74948b5a9fed7d52c13f8d)
at
org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:125)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
~[?:1.8.0_221]
at
org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
~[?:1.8.0_221]
at
org.apache.flink.client.program.rest.RestClusterClient.lambda$pollResourceAsync$26(RestClusterClient.java:698)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
~[?:1.8.0_221]
at
org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
~[?:1.8.0_221]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_221]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_221]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_221]
Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution
failed.
at
org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:144)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.client.deployment.ClusterClientJobClientAdapter.lambda$null$6(ClusterClientJobClientAdapter.java:123)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
java.util.concurrent.CompletableFuture.uniApply(CompletableFuture.java:602)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniApply.tryFire(CompletableFuture.java:577)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
~[?:1.8.0_221]
at
org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
~[?:1.8.0_221]
at
org.apache.flink.client.program.rest.RestClusterClient.lambda$pollResourceAsync$26(RestClusterClient.java:698)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:1962)
~[?:1.8.0_221]
at
org.apache.flink.util.concurrent.FutureUtils.lambda$retryOperationWithDelay$9(FutureUtils.java:403)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:760)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:736)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:474)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture.postFire(CompletableFuture.java:561)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:929)
~[?:1.8.0_221]
at
java.util.concurrent.CompletableFuture$Completion.run(CompletableFuture.java:442)
~[?:1.8.0_221]
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
~[?:1.8.0_221]
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
~[?:1.8.0_221]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_221]
Caused by: org.apache.flink.runtime.JobException: Recovery is suppressed by
NoRestartBackoffTimeStrategy
at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.handleFailure(ExecutionFailureHandler.java:138)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.executiongraph.failover.flip1.ExecutionFailureHandler.getFailureHandlingResult(ExecutionFailureHandler.java:82)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.scheduler.DefaultScheduler.handleTaskFailure(DefaultScheduler.java:228)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.scheduler.DefaultScheduler.maybeHandleTaskFailure(DefaultScheduler.java:218)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.scheduler.DefaultScheduler.updateTaskExecutionStateInternal(DefaultScheduler.java:209)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.scheduler.SchedulerBase.updateTaskExecutionState(SchedulerBase.java:679)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.scheduler.SchedulerNG.updateTaskExecutionState(SchedulerNG.java:79)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.jobmaster.JobMaster.updateTaskExecutionState(JobMaster.java:444)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at sun.reflect.GeneratedMethodAccessor14.invoke(Unknown Source) ~[?:?]
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
~[?:1.8.0_221]
at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_221]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.lambda$handleRpcInvocation$1(AkkaRpcActor.java:316)
~[?:?]
at
org.apache.flink.runtime.concurrent.akka.ClassLoadingUtils.runWithContextClassLoader(ClassLoadingUtils.java:83)
~[?:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcInvocation(AkkaRpcActor.java:314)
~[?:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleRpcMessage(AkkaRpcActor.java:217)
~[?:?]
at
org.apache.flink.runtime.rpc.akka.FencedAkkaRpcActor.handleRpcMessage(FencedAkkaRpcActor.java:78)
~[?:?]
at
org.apache.flink.runtime.rpc.akka.AkkaRpcActor.handleMessage(AkkaRpcActor.java:163)
~[?:?]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:24) ~[?:?]
at akka.japi.pf.UnitCaseStatement.apply(CaseStatements.scala:20) ~[?:?]
at scala.PartialFunction.applyOrElse(PartialFunction.scala:123)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at scala.PartialFunction.applyOrElse$(PartialFunction.scala:122)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at akka.japi.pf.UnitCaseStatement.applyOrElse(CaseStatements.scala:20)
~[?:?]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:171)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at scala.PartialFunction$OrElse.applyOrElse(PartialFunction.scala:172)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at akka.actor.Actor.aroundReceive(Actor.scala:537) ~[?:?]
at akka.actor.Actor.aroundReceive$(Actor.scala:535) ~[?:?]
at akka.actor.AbstractActor.aroundReceive(AbstractActor.scala:220)
~[?:?]
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:580) ~[?:?]
at akka.actor.ActorCell.invoke(ActorCell.scala:548) ~[?:?]
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:270) ~[?:?]
at akka.dispatch.Mailbox.run(Mailbox.scala:231) ~[?:?]
at akka.dispatch.Mailbox.exec(Mailbox.scala:243) ~[?:?]
at java.util.concurrent.ForkJoinTask.doExec(ForkJoinTask.java:289)
~[?:1.8.0_221]
at
java.util.concurrent.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1056)
~[?:1.8.0_221]
at java.util.concurrent.ForkJoinPool.runWorker(ForkJoinPool.java:1692)
~[?:1.8.0_221]
at
java.util.concurrent.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:157)
~[?:1.8.0_221]
Caused by: java.io.IOException
at
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartitionDirectTransferReader$FileRegionReader.<init>(BoundedBlockingSubpartitionDirectTransferReader.java:229)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartitionDirectTransferReader.<init>(BoundedBlockingSubpartitionDirectTransferReader.java:82)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.io.network.partition.BoundedBlockingSubpartition.createReadView(BoundedBlockingSubpartition.java:226)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.io.network.partition.BufferWritingResultPartition.createSubpartitionView(BufferWritingResultPartition.java:209)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.io.network.partition.ResultPartitionManager.createSubpartitionView(ResultPartitionManager.java:76)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.io.network.partition.consumer.LocalInputChannel.requestSubpartition(LocalInputChannel.java:133)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.internalRequestPartitions(SingleInputGate.java:330)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.io.network.partition.consumer.SingleInputGate.requestPartitions(SingleInputGate.java:299)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.taskmanager.InputGateWithMetrics.requestPartitions(InputGateWithMetrics.java:127)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTaskActionExecutor$1.runThrowing(StreamTaskActionExecutor.java:50)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.streaming.runtime.tasks.mailbox.Mail.run(Mail.java:90)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMailsNonBlocking(MailboxProcessor.java:358)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.processMail(MailboxProcessor.java:322)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.streaming.runtime.tasks.mailbox.MailboxProcessor.runMailboxLoop(MailboxProcessor.java:202)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTask.runMailboxLoop(StreamTask.java:809)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:761)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.taskmanager.Task.runWithSystemExitMonitoring(Task.java:958)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at
org.apache.flink.runtime.taskmanager.Task.restoreAndInvoke(Task.java:937)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:766)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:575)
~[flink-dist_2.11-1.14.0.jar:1.14.0]
at java.lang.Thread.run(Thread.java:748) ~[?:1.8.0_221] {code}
We feel that probably the job client should improve its logging, by adding more
details about the failure, such as the information about “Sink Unnamed
Committer”.
We are also checking Flink-1.17.0 to see if it has this issue.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)