JIRA created as https://jira.apache.org/jira/browse/FLINK-16637
Best, tison. Till Rohrmann <trohrm...@apache.org> 于2020年3月17日周二 下午5:57写道: > @Tison could you create an issue to track the problem. Please also link > the uploaded log file for further debugging. > > I think the reason why it worked in Flink 1.9 could have been that we had > a async callback in the longer chain which broke the flow of execution and > allowed to send the response. This is no longer the case. As an easy fix > one could change thenAccept into thenAcceptAsync in the > MiniDispatcher#cancelJob. But this only solves symptoms. Maybe we should > think about allowing not only StatusHandler to close asynchronously. At the > moment we say that all other handler shut down immediately (see > AbstractHandler#closeHandlerAsync). But the problem with this change would > be that all handler would become stateful because they would need to > remember whether a request is currently ongoing or not. > > Cheers, > Till > > On Tue, Mar 17, 2020 at 9:01 AM DONG, Weike <kyled...@connect.hku.hk> > wrote: > >> Hi Tison & Till and all, >> >> I have uploaded the client, taskmanager and jobmanager log to Gist ( >> https://gist.github.com/kylemeow/500b6567368316ec6f5b8f99b469a49f), and I >> can reproduce this bug every time when trying to cancel Flink 1.10 jobs on >> YARN. >> >> Besides, in earlier Flink versions like 1.9, the REST API for *cancelling >> job with a savepoint *sometimes throws exceptions to the client side due >> to >> early shutdown of the server, even though the savepoint was successfully >> completed by reviewing the log, however when using the newly introduced >> *stop* API, that bug disappeared, however, *cancel* API seems to be buggy >> now. >> >> Best, >> Weike >> >> On Tue, Mar 17, 2020 at 10:17 AM tison <wander4...@gmail.com> wrote: >> >> > edit: previously after the cancellation we have a longer call chain to >> > #jobReachedGloballyTerminalState which does the archive job & JM >> graceful >> > showdown, which might take some time so that ... >> > >> > Best, >> > tison. >> > >> > >> > tison <wander4...@gmail.com> 于2020年3月17日周二 上午10:13写道: >> > >> >> Hi Weike & Till, >> >> >> >> I agree with Till and it is also the analysis from my side. However, it >> >> seems even if we don't have FLINK-15116, it is still possible that we >> >> complete the cancel future but the cluster got shutdown before it >> properly >> >> delivered the response. >> >> >> >> There is one thing strange that this behavior almost reproducible, it >> >> should be a possible order but not always. Maybe previous we have to >> >> firstly cancel the job which has a long call chain so that it happens >> we >> >> have enough time to delivered the response. >> >> >> >> But the resolution looks like we introduce some >> >> synchronization/finalization logics that clear these outstanding future >> >> with best effort before the cluster(RestServer) down. >> >> >> >> Best, >> >> tison. >> >> >> >> >> >> Till Rohrmann <trohrm...@apache.org> 于2020年3月17日周二 上午4:12写道: >> >> >> >>> Hi Weike, >> >>> >> >>> could you share the complete logs with us? Attachments are being >> >>> filtered out by the Apache mail server but it works if you upload the >> logs >> >>> somewhere (e.g. https://gist.github.com/) and then share the link >> with >> >>> us. Ideally you run the cluster with DEBUG log settings. >> >>> >> >>> I assume that you are running Flink 1.10, right? >> >>> >> >>> My suspicion is that this behaviour has been introduced with >> FLINK-15116 >> >>> [1]. It looks as if we complete the shutdown future in >> >>> MiniDispatcher#cancelJob before we return the response to the >> >>> RestClusterClient. My guess is that this triggers the shutdown of the >> >>> RestServer which then is not able to serve the response to the >> client. I'm >> >>> pulling in Aljoscha and Tison who introduced this change. They might >> be >> >>> able to verify my theory and propose a solution for it. >> >>> >> >>> [1] https://issues.apache.org/jira/browse/FLINK-15116 >> >>> >> >>> Cheers, >> >>> Till >> >>> >> >>> On Fri, Mar 13, 2020 at 7:50 AM DONG, Weike <kyled...@connect.hku.hk> >> >>> wrote: >> >>> >> >>>> Hi Yangze and all, >> >>>> >> >>>> I have tried numerous times, and this behavior persists. >> >>>> >> >>>> Below is the tail log of taskmanager.log: >> >>>> >> >>>> 2020-03-13 12:06:14.240 [flink-akka.actor.default-dispatcher-3] INFO >> >>>> org.apache.flink.runtime.taskexecutor.slot.TaskSlotTableImpl - >> Free slot >> >>>> TaskSlot(index:0, state:ACTIVE, resource profile: >> >>>> ResourceProfile{cpuCores=1.0000000000000000, taskHeapMemory=1.503gb >> >>>> (1613968148 bytes), taskOffHeapMemory=0 bytes, managedMemory=1.403gb >> >>>> (1505922928 bytes), networkMemory=359.040mb (376480732 bytes)}, >> >>>> allocationId: d3acaeac3db62454742e800b5410adfd, jobId: >> >>>> d0a674795be98bd2574d9ea3286801cb). >> >>>> 2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO >> >>>> org.apache.flink.runtime.taskexecutor.JobLeaderService - Remove job >> >>>> d0a674795be98bd2574d9ea3286801cb from job leader monitoring. >> >>>> 2020-03-13 12:06:14.244 [flink-akka.actor.default-dispatcher-3] INFO >> >>>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Close >> JobManager >> >>>> connection for job d0a674795be98bd2574d9ea3286801cb. >> >>>> 2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO >> >>>> org.apache.flink.runtime.taskexecutor.TaskExecutor - Close >> JobManager >> >>>> connection for job d0a674795be98bd2574d9ea3286801cb. >> >>>> 2020-03-13 12:06:14.250 [flink-akka.actor.default-dispatcher-3] INFO >> >>>> org.apache.flink.runtime.taskexecutor.JobLeaderService - Cannot >> reconnect >> >>>> to job d0a674795be98bd2574d9ea3286801cb because it is not registered. >> >>>> 2020-03-13 12:06:19.744 [SIGTERM handler] INFO >> >>>> org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: >> >>>> SIGTERM. Shutting down as requested. >> >>>> 2020-03-13 12:06:19.744 [SIGTERM handler] INFO >> >>>> org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL 15: >> >>>> SIGTERM. Shutting down as requested. >> >>>> 2020-03-13 12:06:19.745 [PermanentBlobCache shutdown hook] INFO >> >>>> org.apache.flink.runtime.blob.PermanentBlobCache - Shutting down >> BLOB >> >>>> cache >> >>>> 2020-03-13 12:06:19.749 [FileChannelManagerImpl-netty-shuffle >> shutdown >> >>>> hook] INFO org.apache.flink.runtime.io.disk.FileChannelManagerImpl >> - >> >>>> FileChannelManager removed spill file directory >> >>>> >> /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-netty-shuffle-65cd4ebb-51f4-48a9-8e3c-43e431bca46d >> >>>> 2020-03-13 12:06:19.750 [TaskExecutorLocalStateStoresManager shutdown >> >>>> hook] INFO >> >>>> org.apache.flink.runtime.state.TaskExecutorLocalStateStoresManager >> - >> >>>> Shutting down TaskExecutorLocalStateStoresManager. >> >>>> 2020-03-13 12:06:19.750 [TransientBlobCache shutdown hook] INFO >> >>>> org.apache.flink.runtime.blob.TransientBlobCache - Shutting down >> BLOB >> >>>> cache >> >>>> 2020-03-13 12:06:19.751 [FileChannelManagerImpl-io shutdown hook] >> INFO >> >>>> org.apache.flink.runtime.io.disk.FileChannelManagerImpl - >> >>>> FileChannelManager removed spill file directory >> >>>> >> /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-io-67ad5c3a-aec6-42be-ab1f-0ce3841fc4bd >> >>>> 2020-03-13 12:06:19.752 [FileCache shutdown hook] INFO >> >>>> org.apache.flink.runtime.filecache.FileCache - removed file cache >> >>>> directory >> >>>> >> /data/emr/yarn/local/usercache/hadoop/appcache/application_1562207369540_0135/flink-dist-cache-65075ee3-e009-4978-a9d8-ec010e6f4b31 >> >>>> >> >>>> As the tail log of jobmanager.log is kind of lengthy, I have attached >> >>>> it in this mail. >> >>>> >> >>>> From what I have seen, the TaskManager and JobManager shut down by >> >>>> themselves, however, I have noticed some Netty exceptions (from the >> stack >> >>>> trace, it is part of the REST handler) like: >> >>>> >> >>>> ERROR >> >>>> >> org.apache.flink.shaded.netty4.io.netty.util.concurrent.DefaultPromise.rejectedExecution >> >>>> - Failed to submit a listener notification task. Event loop shut >> down? >> >>>> java.util.concurrent.RejectedExecutionException: event executor >> >>>> terminated >> >>>> >> >>>> Thus I suppose that these exceptions might be the actual cause of >> >>>> premature termination of the REST server, and I am still looking >> into the >> >>>> real cause of this. >> >>>> >> >>>> Best, >> >>>> Weike >> >>>> >> >>>> On Fri, Mar 13, 2020 at 1:45 PM Yangze Guo <karma...@gmail.com> >> wrote: >> >>>> >> >>>>> Would you mind to share more information about why the task executor >> >>>>> is killed? If it is killed by Yarn, you might get such info in Yarn >> >>>>> NM/RM logs. >> >>>>> >> >>>>> Best, >> >>>>> Yangze Guo >> >>>>> >> >>>>> Best, >> >>>>> Yangze Guo >> >>>>> >> >>>>> >> >>>>> On Fri, Mar 13, 2020 at 12:31 PM DONG, Weike < >> kyled...@connect.hku.hk> >> >>>>> wrote: >> >>>>> > >> >>>>> > Hi, >> >>>>> > >> >>>>> > Recently I have encountered a strange behavior of Flink on YARN, >> >>>>> which is that when I try to cancel a Flink job running in per-job >> mode on >> >>>>> YARN using commands like >> >>>>> > >> >>>>> > "cancel -m yarn-cluster -yid application_1559388106022_9412 >> >>>>> ed7e2e0ab0a7316c1b65df6047bc6aae" >> >>>>> > >> >>>>> > the client happily found and connected to ResourceManager and then >> >>>>> stucks at >> >>>>> > Found Web Interface 172.28.28.3:50099 of application >> >>>>> 'application_1559388106022_9412'. >> >>>>> > >> >>>>> > And after one minute, an exception is thrown at the client side: >> >>>>> > Caused by: org.apache.flink.util.FlinkException: Could not cancel >> >>>>> job ed7e2e0ab0a7316c1b65df6047bc6aae. >> >>>>> > at >> >>>>> >> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:545) >> >>>>> > at >> >>>>> >> org.apache.flink.client.cli.CliFrontend.runClusterAction(CliFrontend.java:843) >> >>>>> > at >> >>>>> org.apache.flink.client.cli.CliFrontend.cancel(CliFrontend.java:538) >> >>>>> > at >> >>>>> >> org.apache.flink.client.cli.CliFrontend.parseParametersWithException(CliFrontend.java:917) >> >>>>> > at >> >>>>> >> org.apache.flink.client.cli.CliFrontend.lambda$mainWithReturnCodeAndException$10(CliFrontend.java:988) >> >>>>> > at java.security.AccessController.doPrivileged(Native Method) >> >>>>> > at javax.security.auth.Subject.doAs(Subject.java:422) >> >>>>> > at >> >>>>> >> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1754) >> >>>>> > ... 20 more >> >>>>> > Caused by: java.util.concurrent.TimeoutException >> >>>>> > at >> >>>>> >> java.util.concurrent.CompletableFuture.timedGet(CompletableFuture.java:1771) >> >>>>> > at >> >>>>> >> java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1915) >> >>>>> > at >> >>>>> >> org.apache.flink.client.cli.CliFrontend.lambda$cancel$7(CliFrontend.java:543) >> >>>>> > ... 27 more >> >>>>> > >> >>>>> > Then I discovered that the YARN app has already terminated with >> >>>>> FINISHED state and KILLED final status, like below. >> >>>>> > >> >>>>> > And after digging into the log of this finished YARN app, I have >> >>>>> found that TaskManager had already received the SIGTERM signal and >> >>>>> terminated gracefully. >> >>>>> > org.apache.flink.yarn.YarnTaskExecutorRunner - RECEIVED SIGNAL >> 15: >> >>>>> SIGTERM. Shutting down as requested. >> >>>>> > >> >>>>> > Also, the log of JobManager shows that it terminated with exit >> code >> >>>>> 0. >> >>>>> > Terminating cluster entrypoint process YarnJobClusterEntrypoint >> with >> >>>>> exit code 0 >> >>>>> > >> >>>>> > However, the JobManager did not return anything to the client >> before >> >>>>> its shutdown, which is different from previous versions (like Flink >> 1.9). >> >>>>> > >> >>>>> > I wonder if this is a new bug on the flink-clients or flink-yarn >> >>>>> module? >> >>>>> > >> >>>>> > Thank you : ) >> >>>>> > >> >>>>> > Sincerely, >> >>>>> > Weike >> >>>>> >> >>>> >> >