[ https://issues.apache.org/jira/browse/FLINK-10368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Thomas Weise updated FLINK-10368: --------------------------------- Fix Version/s: (was: 1.5.6) > 'Kerberized YARN on Docker test' unstable > ----------------------------------------- > > Key: FLINK-10368 > URL: https://issues.apache.org/jira/browse/FLINK-10368 > Project: Flink > Issue Type: Bug > Components: Tests > Affects Versions: 1.5.3, 1.6.0, 1.7.0 > Reporter: Till Rohrmann > Assignee: Dawid Wysakowicz > Priority: Critical > Labels: pull-request-available, test-stability > Fix For: 1.6.4, 1.7.2 > > > Running Kerberized YARN on Docker test end-to-end test failed on an AWS > instance. The problem seems to be that the NameNode went into safe-mode due > to limited resources. > {code} > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/hadoop-user/flink-1.6.1/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: Found binding in > [jar:file:/usr/local/hadoop-2.8.4/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class] > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] > 2018-09-19 09:04:39,201 INFO org.apache.hadoop.security.UserGroupInformation > - Login successful for user hadoop-user using keytab file > /home/hadoop-user/hadoop-user.keytab > 2018-09-19 09:04:39,453 INFO org.apache.hadoop.yarn.client.RMProxy > - Connecting to ResourceManager at > master.docker-hadoop-cluster-network/172.22.0.3:8032 > 2018-09-19 09:04:39,640 INFO org.apache.hadoop.yarn.client.AHSProxy > - Connecting to Application History server at > master.docker-hadoop-cluster-network/172.22.0.3:10200 > 2018-09-19 09:04:39,656 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli > - No path for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > 2018-09-19 09:04:39,656 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli > - No path for the flink jar passed. Using the location of class > org.apache.flink.yarn.YarnClusterDescriptor to locate the jar > 2018-09-19 09:04:39,901 INFO > org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster > specification: ClusterSpecification{masterMemoryMB=2000, > taskManagerMemoryMB=2000, numberTaskManagers=3, slotsPerTaskManager=1} > 2018-09-19 09:04:40,286 WARN > org.apache.flink.yarn.AbstractYarnClusterDescriptor - The > configuration directory ('/home/hadoop-user/flink-1.6.1/conf') contains both > LOG4J and Logback configuration files. Please delete or rename one of them. > ------------------------------------------------------------ > The program finished with the following exception: > org.apache.flink.client.deployment.ClusterDeploymentException: Couldn't > deploy Yarn session cluster > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:420) > at > org.apache.flink.client.cli.CliFrontend.runProgram(CliFrontend.java:259) > at org.apache.flink.client.cli.CliFrontend.run(CliFrontend.java:215) > at > org.apache.flink.client.cli.CliFrontend.parseParameters(CliFrontend.java:1044) > at > org.apache.flink.client.cli.CliFrontend.lambda$main$11(CliFrontend.java:1120) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1836) > at > org.apache.flink.runtime.security.HadoopSecurityContext.runSecured(HadoopSecurityContext.java:41) > at org.apache.flink.client.cli.CliFrontend.main(CliFrontend.java:1120) > Caused by: org.apache.hadoop.hdfs.server.namenode.SafeModeException: Cannot > create > file/user/hadoop-user/.flink/application_1537266361291_0099/lib/slf4j-log4j12-1.7.7.jar. > Name node is in safe mode. > Resources are low on NN. Please add or free up more resources then turn off > safe mode manually. NOTE: If you turn off safe mode before adding resources, > the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode > leave" to turn safe mode off. > NamenodeHostName:master.docker-hadoop-cluster-network > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1407) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1395) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2278) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2223) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:423) > at > org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:121) > at > org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:88) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:270) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1274) > at org.apache.hadoop.hdfs.DFSClient.create(DFSClient.java:1216) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:473) > at > org.apache.hadoop.hdfs.DistributedFileSystem$8.doCall(DistributedFileSystem.java:470) > at > org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:470) > at > org.apache.hadoop.hdfs.DistributedFileSystem.create(DistributedFileSystem.java:411) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:929) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:910) > at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:807) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:368) > at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:341) > at > org.apache.hadoop.fs.FileSystem.copyFromLocalFile(FileSystem.java:2002) > at org.apache.flink.yarn.Utils.setupLocalResource(Utils.java:162) > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor.setupSingleLocalResource(AbstractYarnClusterDescriptor.java:1139) > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor.access$000(AbstractYarnClusterDescriptor.java:111) > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor$1.visitFile(AbstractYarnClusterDescriptor.java:1200) > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor$1.visitFile(AbstractYarnClusterDescriptor.java:1188) > at java.nio.file.Files.walkFileTree(Files.java:2670) > at java.nio.file.Files.walkFileTree(Files.java:2742) > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor.uploadAndRegisterFiles(AbstractYarnClusterDescriptor.java:1188) > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor.startAppMaster(AbstractYarnClusterDescriptor.java:800) > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor.deployInternal(AbstractYarnClusterDescriptor.java:542) > at > org.apache.flink.yarn.AbstractYarnClusterDescriptor.deploySessionCluster(AbstractYarnClusterDescriptor.java:413) > ... 9 more > Caused by: > org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.SafeModeException): > Cannot create > file/user/hadoop-user/.flink/application_1537266361291_0099/lib/slf4j-log4j12-1.7.7.jar. > Name node is in safe mode. > Resources are low on NN. Please add or free up more resources then turn off > safe mode manually. NOTE: If you turn off safe mode before adding resources, > the NN will immediately return to safe mode. Use "hdfs dfsadmin -safemode > leave" to turn safe mode off. > NamenodeHostName:master.docker-hadoop-cluster-network > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.newSafemodeException(FSNamesystem.java:1407) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode(FSNamesystem.java:1395) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2278) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2223) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.create(NameNodeRpcServer.java:728) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.create(ClientNamenodeProtocolServerSideTranslatorPB.java:413) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:447) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:989) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:850) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:793) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1840) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2489) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1489) > at org.apache.hadoop.ipc.Client.call(Client.java:1435) > at org.apache.hadoop.ipc.Client.call(Client.java:1345) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:227) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > at com.sun.proxy.$Proxy14.create(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.create(ClientNamenodeProtocolTranslatorPB.java:297) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:498) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:409) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:163) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:155) > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:95) > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:346) > at com.sun.proxy.$Proxy15.create(Unknown Source) > at > org.apache.hadoop.hdfs.DFSOutputStream.newStreamForCreate(DFSOutputStream.java:265) > ... 33 more > Running the Flink job failed, might be that the cluster is not ready yet. We > have been trying for 795 seconds, retrying ... > {code} > I think it would be good to harden the test. -- This message was sent by Atlassian JIRA (v7.6.3#76005)