Re: Flink YARN app terminated before the client receives the result

tison Mon, 16 Mar 2020 19:17:27 -0700

edit: previously after the cancellation we have a longer call chain to
#jobReachedGloballyTerminalState which does the archive job & JM graceful
showdown, which might take some time so that ...


Best,
tison.


tison <wander4...@gmail.com> 于2020年3月17日周二 上午10:13写道：

> Hi Weike & Till,
>
> I agree with Till and it is also the analysis from my side. However, it
> seems even if we don't have FLINK-15116, it is still possible that we
> complete the cancel future but the cluster got shutdown before it properly
> delivered the response.
>
> There is one thing strange that this behavior almost reproducible, it
> should be a possible order but not always. Maybe previous we have to
> firstly cancel the job which has a long call chain so that it happens we
> have enough time to delivered the response.
>
> But the resolution looks like we introduce some
> synchronization/finalization logics that clear these outstanding future
> with best effort before the cluster(RestServer) down.
>
> Best,
> tison.
>
>
> Till Rohrmann <trohrm...@apache.org> 于2020年3月17日周二 上午4:12写道：
>
>> Hi Weike,
>>
>> could you share the complete logs with us? Attachments are being filtered
>> out by the Apache mail server but it works if you upload the logs somewhere
>> (e.g. https://gist.github.com/) and then share the link with us. Ideally
>> you run the cluster with DEBUG log settings.
>>
>> I assume that you are running Flink 1.10, right?
>>
>> My suspicion is that this behaviour has been introduced with FLINK-15116
>> [1]. It looks as if we complete the shutdown future in
>> MiniDispatcher#cancelJob before we return the response to the
>> RestClusterClient. My guess is that this triggers the shutdown of the
>> RestServer which then is not able to serve the response to the client. I'm
>> pulling in Aljoscha and Tison who introduced this change. They might be
>> able to verify my theory and propose a solution for it.
>>
>> [1] https://issues.apache.org/jira/browse/FLINK-15116
>>
>> Cheers,
>> Till
>>
>> On Fri, Mar 13, 2020 at 7:50 AM DONG, Weike <kyled...@connect.hku.hk>
>> wrote:
>>
>>> Hi Yangze and all,
>>>
>>> I have tried numerous times, and this behavior persists.
>>>
>>> Below is the tail log of taskmanager.log:
>>>
>>> 2020-03-13 12:06:14.240 [flink-akka.actor.default-dispatcher-3] INFO
>>>  org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl  - Free slot
>>> TaskSlot(index:0, state:ACTIVE, resource profile:
>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.503gb
>>> (1613968148 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.403gb
>>> (1505922928 bytes), networkMemory=359.040mb (376480732 bytes)},
>>> allocationId: d3acaeac3db62454742e800b5410adfd, jobId:
>>> d0a674795be98bd2574d9ea3286801cb).
>>> 2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO
>>>  org.apache.flink.runtime.taskexecutor.JobLeaderService  - Remove job
>>> d0a674795be98bd2574d9ea3286801cb from job leader monitoring.
>>> 2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO
>>>  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Close JobManager
>>> connection for job d0a674795be98bd2574d9ea3286801cb.
>>> 2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO
>>>  org.apache.flink.runtime.taskexecutor.TaskExecutor  - Close JobManager
>>> connection for job d0a674795be98bd2574d9ea3286801cb.
>>> 2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO
>>>  org.apache.flink.runtime.taskexecutor.JobLeaderService  - Cannot reconnect
>>> to job d0a674795be98bd2574d9ea3286801cb because it is not registered.
>>> 2020-03-13 12:06:19.744 [SIGTERM handler] INFO
>>>  org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15:
>>> SIGTERM. Shutting down as requested.
>>> 2020-03-13 12:06:19.744 [SIGTERM handler] INFO
>>>  org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15:
>>> SIGTERM. Shutting down as requested.
>>> 2020-03-13 12:06:19.745 [PermanentBlobCache shutdown hook] INFO
>>>  org.apache.flink.runtime.blob.PermanentBlobCache  - Shutting down BLOB
>>> cache
>>> 2020-03-13 12:06:19.749 [FileChannelManagerImpl-netty-shuffle shutdown
>>> hook] INFO  org.apache.flink.runtime.io.disk.FileChannelManagerImpl  -
>>> FileChannelManager removed spill file directory
>>> /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-netty-shuffle-65cd4ebb-51f4-48a9-8e3c-43e431bca46d
>>> 2020-03-13 12:06:19.750 [TaskExecutorLocalStateStoresManager shutdown
>>> hook] INFO
>>>  org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager  -
>>> Shutting down TaskExecutorLocalStateStoresManager.
>>> 2020-03-13 12:06:19.750 [TransientBlobCache shutdown hook] INFO
>>>  org.apache.flink.runtime.blob.TransientBlobCache  - Shutting down BLOB
>>> cache
>>> 2020-03-13 12:06:19.751 [FileChannelManagerImpl-io shutdown hook] INFO
>>>  org.apache.flink.runtime.io.disk.FileChannelManagerImpl  -
>>> FileChannelManager removed spill file directory
>>> /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-io-67ad5c3a-aec6-42be-ab1f-0ce3841fc4bd
>>> 2020-03-13 12:06:19.752 [FileCache shutdown hook] INFO
>>>  org.apache.flink.runtime.filecache.FileCache  - removed file cache
>>> directory
>>> /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-dist-cache-65075ee3-e009-4978-a9d8-ec010e6f4b31
>>>
>>> As the tail log of jobmanager.log is kind of lengthy, I have attached it
>>> in this mail.
>>>
>>> From what I have seen, the TaskManager and JobManager shut down by
>>> themselves, however, I have noticed some Netty exceptions (from the stack
>>> trace, it is part of the REST handler) like:
>>>
>>> ERROR
>>> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution
>>>  - Failed to submit a listener notification task. Event loop shut down?
>>> java.util.concurrent.RejectedExecutionException: event executor
>>> terminated
>>>
>>> Thus I suppose that these exceptions might be the actual cause of
>>> premature termination of the REST server, and I am still looking into the
>>> real cause of this.
>>>
>>> Best,
>>> Weike
>>>
>>> On Fri, Mar 13, 2020 at 1:45 PM Yangze Guo <karma...@gmail.com> wrote:
>>>
>>>> Would you mind to share more information about why the task executor
>>>> is killed? If it is killed by Yarn, you might get such info in Yarn
>>>> NM/RM logs.
>>>>
>>>> Best,
>>>> Yangze Guo
>>>>
>>>> Best,
>>>> Yangze Guo
>>>>
>>>>
>>>> On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike <kyled...@connect.hku.hk>
>>>> wrote:
>>>> >
>>>> > Hi,
>>>> >
>>>> > Recently I have encountered a strange behavior of Flink on YARN,
>>>> which is that when I try to cancel a Flink job running in per-job mode on
>>>> YARN using commands like
>>>> >
>>>> > "cancel -m yarn-cluster -yid application_1559388106022_9412
>>>> ed7e2e0ab0a7316c1b65df6047bc6aae"
>>>> >
>>>> > the client happily found and connected to ResourceManager and then
>>>> stucks at
>>>> > Found Web Interface 172.28.28.3:50099 of application
>>>> 'application_1559388106022_9412'.
>>>> >
>>>> > And after one minute, an exception is thrown at the client side:
>>>> > Caused by: org.apache.flink.util.FlinkException: Could not cancel job
>>>> ed7e2e0ab0a7316c1b65df6047bc6aae.
>>>> >     at
>>>> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545)
>>>> >     at
>>>> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843)
>>>> >     at
>>>> org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538)
>>>> >     at
>>>> org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917)
>>>> >     at
>>>> org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988)
>>>> >     at java.security.AccessController.doPrivileged(Native Method)
>>>> >     at javax.security.auth.Subject.doAs(Subject.java:422)
>>>> >     at
>>>> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754)
>>>> >     ... 20 more
>>>> > Caused by: java.util.concurrent.TimeoutException
>>>> >     at
>>>> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771)
>>>> >     at
>>>> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915)
>>>> >     at
>>>> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543)
>>>> >     ... 27 more
>>>> >
>>>> > Then I discovered that the YARN app has already terminated with
>>>> FINISHED state and KILLED final status, like below.
>>>> >
>>>> > And after digging into the log of this finished YARN app, I have
>>>> found that TaskManager had already received the SIGTERM signal and
>>>> terminated gracefully.
>>>> > org.apache.flink.yarn.YarnTaskExecutorRunner  - RECEIVED SIGNAL 15:
>>>> SIGTERM. Shutting down as requested.
>>>> >
>>>> > Also, the log of JobManager shows that it terminated with exit code 0.
>>>> > Terminating cluster entrypoint process YarnJobClusterEntrypoint with
>>>> exit code 0
>>>> >
>>>> > However, the JobManager did not return anything to the client before
>>>> its shutdown, which is different from previous versions (like Flink 1.9).
>>>> >
>>>> > I wonder if this is a new bug on the flink-clients or flink-yarn
>>>> module?
>>>> >
>>>> > Thank you : )
>>>> >
>>>> > Sincerely,
>>>> > Weike
>>>>
>>>

Re: Flink YARN app terminated before the client receives the result

Reply via email to