[ https://issues.apache.org/jira/browse/FLINK-18681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17166878#comment-17166878 ]
Xintong Song commented on FLINK-18681: -------------------------------------- [~apach...@163.com], thanks for providing the screenshot and logs. I found the following warnings in the Yarn RM log. {code:java} 2020-07-22 17:54:57,155 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id. PERMISSIONS=Unauthorized access or invalid container APPID=application_1590424616102_556340 CONTAINERID=container_1590424616102_556340_01_000002 2020-07-22 17:54:58,157 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id. PERMISSIONS=Unauthorized access or invalid container APPID=application_1590424616102_556340 CONTAINERID=container_1590424616102_556340_01_000003 2020-07-22 17:54:59,160 WARN org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wangty IP=x.x.x.61 OPERATION=AM Released Container TARGET=Scheduler RESULT=FAILURE DESCRIPTION=Trying to release container not owned by app or with invalid id. PERMISSIONS=Unauthorized access or invalid container APPID=application_1590424616102_556340 CONTAINERID=container_1590424616102_556340_01_000004 {code} It shows that Flink did released the containers, but the operations were rejected by the Yarn RM. The API Flink uses for release containers is {{AMRMClientAsync#releaseAssignedContainer}}, via the same client that successfully allocated containers from Yarn. {code:java} /** * Release containers assigned by the Resource Manager. If the app cannot use * the container or wants to give up the container then it can release them. * The app needs to make new requests for the released resource capability if * it still needs it. eg. it released non-local resources * @param containerId */ public abstract void releaseAssignedContainer(ContainerId containerId); {code} It seems to me that the Hadoop API did not work as expected. I would suggest to try get some help from the Apache Hadoop community. Pulling in [~Tao Yang] who is an Apache Hadoop committer and expert in Yarn. > The jar package version conflict causes the task to continue to increase and > grab resources > ------------------------------------------------------------------------------------------- > > Key: FLINK-18681 > URL: https://issues.apache.org/jira/browse/FLINK-18681 > Project: Flink > Issue Type: Bug > Affects Versions: 1.11.0 > Reporter: wangtaiyang > Priority: Major > Attachments: appId.log, dependency.log, > image-2020-07-28-15-32-51-851.png, > yarn-hadoop-resourcemanager-x.x.x.15.log.2020-07-22-17.log > > > When I submit a flink task to yarn, the default resource configuration is > 1G&1core, but in fact this task will always increase resources 2core, 3core, > and so on. . . 200core. . . Then I went to look at the JM log and found the > following error: > {code:java} > //代码占位符 > java.lang.NoSuchMethodError: > org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder;java.lang.NoSuchMethodError: > > org.apache.commons.cli.Option.builder(Ljava/lang/String;)Lorg/apache/commons/cli/Option$Builder; > at > org.apache.flink.runtime.entrypoint.parser.CommandLineOptions.<clinit>(CommandLineOptions.java:28) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > ~[?:1.8.0_191] > ....... > java.lang.NoClassDefFoundError: Could not initialize class > org.apache.flink.runtime.entrypoint.parser.CommandLineOptionsjava.lang.NoClassDefFoundError: > Could not initialize class > org.apache.flink.runtime.entrypoint.parser.CommandLineOptions at > org.apache.flink.runtime.clusterframework.BootstrapTools.lambda$getDynamicPropertiesAsString$0(BootstrapTools.java:648) > ~[flink-dist_2.11-1.11.1.jar:1.11.1] at > java.util.stream.ReferencePipeline$7$1.accept(ReferencePipeline.java:267) > ~[?:1.8.0_191] at > java.util.HashMap$KeySpliterator.forEachRemaining(HashMap.java:1553) > ~[?:1.8.0_191] at > java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) > ~[?:1.8.0_191]{code} > Finally, it is confirmed that it is caused by the commands-cli version > conflict, but the task reporting error has not stopped and will continue to > grab resources and increase. Is this a bug? -- This message was sent by Atlassian Jira (v8.3.4#803005)