Re: flink yarn-session failure

Robert Metzger Thu, 19 Nov 2015 02:48:07 -0800

The exception is thrown even before Flink code is executed, so I assume
that your YARN setup is not properly working.
Did you try running any other YARN application on the setup? I suspect that
other systems like MapReduce or Spark will also not run on the environment.


Maybe the yarn-site.xml on the NodeManager hosts is not correct (pointing
to localhost instead of the master)

On Thu, Nov 19, 2015 at 11:41 AM, Stefanos Antaris <
antaris.stefa...@gmail.com> wrote:

> Hi to all,
>
> i am trying to use Flink with Hadoop yarn but i am facing an exception
> while trying to create a yarn-session.
>
> First of all, i have a Hadoop cluster with 20 VMs that uses yarn. I can
> start the Hadoop cluster and run Hadoop jobs without any problem.
> Furthermore, i am trying to deploy a Flink cluster on the same VMs and use
> the Flink Yarn client. I have the HADOOP_HOME environmental variable set
> and the hadoop cluster up and running. When i execute the
> ./bin/yarn-session.sh -n 10 -tm 8192 -s 32 command i have the following
> exception. Can someone explain me how to solve this?
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *10:20:56,105 INFO  org.apache.hadoop.yarn.client.RMProxy
>         - Connecting to ResourceManager at master/192.168.0.194:8032
> <http://192.168.0.194:8032>10:20:56,353
> WARN  org.apache.hadoop.util.NativeCodeLoader                       -
> Unable to load native-hadoop library for your platform... using
> builtin-java classes where applicable10:20:57,095
> INFO  org.apache.flink.yarn.FlinkYarnClient                         - Using
> values:10:20:57,097 INFO  org.apache.flink.yarn.FlinkYarnClient
>             -  TaskManager count = 1010:20:57,097
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> JobManager memory = 102410:20:57,097
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> TaskManager memory = 204810:20:57,365
> WARN  org.apache.flink.yarn.FlinkYarnClient                         - This
> YARN session requires 21504MB of memory in the cluster. There are currently
> only 8192MB available.The Flink YARN client will try to allocate the YARN
> session, but maybe not all TaskManagers are connecting from the beginning
> because the resources are currently not available in the cluster. The
> allocation might take more time than usual because the Flink YARN client
> needs to wait until the resources become available.10:20:57,365
> WARN  org.apache.flink.yarn.FlinkYarnClient                         - There
> is not enough memory available in the YARN cluster. The TaskManager(s)
> require 2048MB each. NodeManagers available: [8192]After allocating the
> JobManager (1024MB) and (3/10) TaskManagers, the following NodeManagers are
> available: [1024]The Flink YARN client will try to allocate the YARN
> session, but maybe not all TaskManagers are connecting from the beginning
> because the resources are currently not available in the cluster. The
> allocation might take more time than usual because the Flink YARN client
> needs to wait until the resources become available.10:20:57,365
> WARN  org.apache.flink.yarn.FlinkYarnClient                         - There
> is not enough memory available in the YARN cluster. The TaskManager(s)
> require 2048MB each. NodeManagers available: [8192]After allocating the
> JobManager (1024MB) and (4/10) TaskManagers, the following NodeManagers are
> available: [1024]The Flink YARN client will try to allocate the YARN
> session, but maybe not all TaskManagers are connecting from the beginning
> because the resources are currently not available in the cluster. The
> allocation might take more time than usual because the Flink YARN client
> needs to wait until the resources become available.10:20:57,366
> WARN  org.apache.flink.yarn.FlinkYarnClient                         - There
> is not enough memory available in the YARN cluster. The TaskManager(s)
> require 2048MB each. NodeManagers available: [8192]After allocating the
> JobManager (1024MB) and (5/10) TaskManagers, the following NodeManagers are
> available: [1024]The Flink YARN client will try to allocate the YARN
> session, but maybe not all TaskManagers are connecting from the beginning
> because the resources are currently not available in the cluster. The
> allocation might take more time than usual because the Flink YARN client
> needs to wait until the resources become available.10:20:57,366
> WARN  org.apache.flink.yarn.FlinkYarnClient                         - There
> is not enough memory available in the YARN cluster. The TaskManager(s)
> require 2048MB each. NodeManagers available: [8192]After allocating the
> JobManager (1024MB) and (6/10) TaskManagers, the following NodeManagers are
> available: [1024]The Flink YARN client will try to allocate the YARN
> session, but maybe not all TaskManagers are connecting from the beginning
> because the resources are currently not available in the cluster. The
> allocation might take more time than usual because the Flink YARN client
> needs to wait until the resources become available.10:20:57,366
> WARN  org.apache.flink.yarn.FlinkYarnClient                         - There
> is not enough memory available in the YARN cluster. The TaskManager(s)
> require 2048MB each. NodeManagers available: [8192]After allocating the
> JobManager (1024MB) and (7/10) TaskManagers, the following NodeManagers are
> available: [1024]The Flink YARN client will try to allocate the YARN
> session, but maybe not all TaskManagers are connecting from the beginning
> because the resources are currently not available in the cluster. The
> allocation might take more time than usual because the Flink YARN client
> needs to wait until the resources become available.10:20:57,366
> WARN  org.apache.flink.yarn.FlinkYarnClient                         - There
> is not enough memory available in the YARN cluster. The TaskManager(s)
> require 2048MB each. NodeManagers available: [8192]After allocating the
> JobManager (1024MB) and (8/10) TaskManagers, the following NodeManagers are
> available: [1024]The Flink YARN client will try to allocate the YARN
> session, but maybe not all TaskManagers are connecting from the beginning
> because the resources are currently not available in the cluster. The
> allocation might take more time than usual because the Flink YARN client
> needs to wait until the resources become available.10:20:57,366
> WARN  org.apache.flink.yarn.FlinkYarnClient                         - There
> is not enough memory available in the YARN cluster. The TaskManager(s)
> require 2048MB each. NodeManagers available: [8192]After allocating the
> JobManager (1024MB) and (9/10) TaskManagers, the following NodeManagers are
> available: [1024]The Flink YARN client will try to allocate the YARN
> session, but maybe not all TaskManagers are connecting from the beginning
> because the resources are currently not available in the cluster. The
> allocation might take more time than usual because the Flink YARN client
> needs to wait until the resources become available.10:20:58,204
> INFO  org.apache.flink.yarn.Utils                                   -
> Copying from file:/home/hduser/flink-0.10.0/lib/flink-dist-0.10.0.jar to
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/flink-dist-0.10.0.jar10:21:00,235
> INFO  org.apache.flink.yarn.Utils                                   -
> Copying from /home/hduser/flink-0.10.0/conf/flink-conf.yaml to
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/flink-conf.yaml10:21:00,277
> INFO  org.apache.flink.yarn.Utils                                   -
> Copying from file:/home/hduser/flink-0.10.0/lib/log4j-1.2.17.jar to
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/log4j-1.2.17.jar10:21:00,349
> INFO  org.apache.flink.yarn.Utils                                   -
> Copying from file:/home/hduser/flink-0.10.0/lib/slf4j-log4j12-1.7.7.jar to
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/slf4j-log4j12-1.7.7.jar10:21:00,400
> INFO  org.apache.flink.yarn.Utils                                   -
> Copying from file:/home/hduser/flink-0.10.0/lib/flink-python-0.10.0.jar to
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/flink-python-0.10.0.jar10:21:00,441
> INFO  org.apache.flink.yarn.Utils                                   -
> Copying from file:/home/hduser/flink-0.10.0/conf/logback.xml to
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/logback.xml10:21:00,486
> INFO  org.apache.flink.yarn.Utils                                   -
> Copying from file:/home/hduser/flink-0.10.0/conf/log4j.properties
> to 
> hdfs://master:54310/user/hduser/.flink/application_1447928096470_0002/log4j.properties10:21:00,553
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Submitting application master application_1447928096470_000210:21:00,963
> INFO  org.apache.hadoop.yarn.client.api.impl.YarnClientImpl         -
> Submitted application application_1447928096470_000210:21:00,964
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Waiting for the cluster to be allocated10:21:00,969
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:01,973
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:02,977
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:03,982
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:04,986
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:05,990
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:06,994
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:07,996
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:09,003
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:10,007
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTED10:21:11,011
> INFO  org.apache.flink.yarn.FlinkYarnClient                         -
> Deploying cluster, current state ACCEPTEDError while deploying YARN
> cluster: The YARN application unexpectedly switched to state FAILED during
> deployment. Diagnostics from YARN: Application
> application_1447928096470_0002 failed 1 times due to Error launching
> appattempt_1447928096470_0002_000001. Got exception:
> java.net.ConnectException: Call From flink-master/127.0.0.1
> <http://127.0.0.1> to localhost:38425 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details
> see:  http://wiki.apache.org/hadoop/ConnectionRefused
> <http://wiki.apache.org/hadoop/ConnectionRefused> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) at
> org.apache.hadoop.ipc.Client.call(Client.java:1480) at
> org.apache.hadoop.ipc.Client.call(Client.java:1407) at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy31.startContainers(Unknown Source) at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
> at
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
> at
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)Caused by:
> java.net.ConnectException: Connection refused at
> sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609) at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) at
> org.apache.hadoop.ipc.Client.getConnection(Client.java:1529) at
> org.apache.hadoop.ipc.Client.call(Client.java:1446) ... 9 more. Failing the
> application.If log aggregation is enabled on your cluster, use this command
> to further investigate the issue:yarn logs -applicationId
> application_1447928096470_0002org.apache.flink.yarn.FlinkYarnClientBase$YarnDeploymentException:
> The YARN application unexpectedly switched to state FAILED during
> deployment. Diagnostics from YARN: Application
> application_1447928096470_0002 failed 1 times due to Error launching
> appattempt_1447928096470_0002_000001. Got exception:
> java.net.ConnectException: Call From flink-master/127.0.0.1
> <http://127.0.0.1> to localhost:38425 failed on connection exception:
> java.net.ConnectException: Connection refused; For more details
> see:  http://wiki.apache.org/hadoop/ConnectionRefused
> <http://wiki.apache.org/hadoop/ConnectionRefused> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:526) at
> org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:792) at
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:732) at
> org.apache.hadoop.ipc.Client.call(Client.java:1480) at
> org.apache.hadoop.ipc.Client.call(Client.java:1407) at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
> at com.sun.proxy.$Proxy31.startContainers(Unknown Source) at
> org.apache.hadoop.yarn.api.impl.pb.client.ContainerManagementProtocolPBClientImpl.startContainers(ContainerManagementProtocolPBClientImpl.java:96)
> at
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.launch(AMLauncher.java:119)
> at
> org.apache.hadoop.yarn.server.resourcemanager.amlauncher.AMLauncher.run(AMLauncher.java:254)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> at java.lang.Thread.run(Thread.java:745)Caused by:
> java.net.ConnectException: Connection refused at
> sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:744) at
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:531) at
> org.apache.hadoop.net.NetUtils.connect(NetUtils.java:495) at
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:609) at
> org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:707) at
> org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:370) at
> org.apache.hadoop.ipc.Client.getConnection(Client.java:1529) at
> org.apache.hadoop.ipc.Client.call(Client.java:1446) ... 9 more. Failing the
> application.If log aggregation is enabled on your cluster, use this command
> to further investigate the issue:yarn logs -applicationId
> application_1447928096470_0002 at
> org.apache.flink.yarn.FlinkYarnClientBase.deployInternal(FlinkYarnClientBase.java:646)
> at
> org.apache.flink.yarn.FlinkYarnClientBase.deploy(FlinkYarnClientBase.java:338)
> at
> org.apache.flink.client.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:409)
> at
> org.apache.flink.client.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:351)*
>
>
>
> Just to mention that my link-conf.yaml is the following :
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> *#==============================================================================#
> Common#==============================================================================#
> The host on which the JobManager runs. Only used in non-high-availability
> mode.# The JobManager process will use this hostname to bind the listening
> servers to.# The TaskManagers will try to connect to the JobManager on that
> host.jobmanager.rpc.address: master# The port where the JobManager's main
> actor system listens for messages.jobmanager.rpc.port: 6123# The heap size
> for the JobManager JVMjobmanager.heap.mb: 256# The heap size for the
> TaskManager JVMtaskmanager.heap.mb: 512# The number of task slots that each
> TaskManager offers. Each slot runs one parallel
> pipeline.taskmanager.numberOfTaskSlots: 10# The parallelism used for
> programs that did not specify and other parallelism.parallelism.default:
> 5#==============================================================================#
> Web
> Frontend#==============================================================================#
> The port under which the web-based runtime monitor listens.# A value of -1
> deactivates the web server.jobmanager.web.port: 8081# The port uder which
> the standalone web client# (for job upload and submit)
> listens.webclient.port:
> 8080#==============================================================================#
> Streaming state
> checkpointing#==============================================================================#
> The backend that will be used to store operator state checkpoints if #
> checkpointing is enabled. ## Supported backends: jobmanager, filesystem,
> <class-name-of-factory> ##state.backend: filesystem# Directory for storing
> checkpoints in a Flink-supported filesystem# Note: State backend must be
> accessible from the JobManager and all TaskManagers.# Use "hdfs://" for
> HDFS setups, "file://" for UNIX/POSIX-compliant file systems,# (or any
> local file system under Windows), or "S3://" for S3 file system.##
> state.backend.fs.checkpointdir:
> hdfs://namenode-host:port/flink-checkpoints#==============================================================================#
> Advanced#==============================================================================#
> The number of buffers for the network stack.##
> taskmanager.network.numberOfBuffers: 2048# Directories for temporary
> files.## Add a delimited list for multiple directories, using the system
> directory# delimiter (colon ':' on unix) or a comma, e.g.:#
>   /data1/tmp:/data2/tmp:/data3/tmp## Note: Each directory entry is read
> from and written to by a different I/O# thread. You can include the same
> directory multiple times in order to create# multiple I/O threads against
> that directory. This is for example relevant for# high-throughput RAIDs.##
> If not specified, the system-specific Java temporary directory
> (java.io.tmpdir# property) is taken.## taskmanager.tmp.dirs: /tmp# Path to
> the Hadoop configuration directory.## This configuration is used when
> writing into HDFS. Unless specified otherwise,# HDFS file creation will use
> HDFS default settings with respect to block-size,# replication factor,
> etc.## You can also directly specify the paths to hdfs-default.xml and
> hdfs-site.xml# via keys 'fs.hdfs.hdfsdefault' and
> 'fs.hdfs.hdfssite'.#fs.hdfs.hadoopconf:
> /usr/local/hadoop/etc/hadoop/#==============================================================================#
> Master High Availability (required
> configuration)#==============================================================================#
> The list of ZooKepper quorum peers that coordinate the high-availability#
> setup. This must be a list of the form:#
> "host1:clientPort,host2[:clientPort],..." (default clientPort: 2181)##
> recovery.mode: zookeeper## recovery.zookeeper.quorum: localhost:2181,...##
> Note: You need to set the state backend to 'filesystem' and the checkpoint#
> directory (see above) before configuring the storageDir.##
> recovery.zookeeper.storageDir: hdfs:///recovery*
>
> Thanks in advance,
> Stefanos Antaris
>
>

Re: flink yarn-session failure

Reply via email to