Yan, I have followed the doc. Here is what was done ... 1. Setup the yarn-site.xml
<configuration> <property> <name>yarn.resourcemanager.ha.enabled</name> <value>true</value> </property> <property> <name>yarn.resourcemanager.cluster-id</name> <value>cluster1</value> </property> <property> <name>yarn.resourcemanager.ha.rm-ids</name> <value>rm1,rm2</value> </property> <property> <name>yarn.resourcemanager.hostname.rm1</name> <value>sprdargas402.</value> </property> <property> <name>yarn.resourcemanager.hostname.rm2</name> <value>sprdargas403.</value> </property> <property> <description>Enable RM to recover state after starting. If true, then yarn.resourcemanager.store.class must be specified</description> <name>yarn.resourcemanager.recovery.enabled</name> <value>true</value> </property> <property> <description>The class to use as the persistent store.</description> <name>yarn.resourcemanager.store.class</name> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value> </property> <property> <name>yarn.resourcemanager.zk-state-store.address</name> <value>sprdargas402.:2181</value> </property> <property> <name>yarn.resourcemanager.zk-address</name> <value>sprdargas402.:2181,sprdargas403.:2181,sprdargas404.:2181</value> </property> <property> <description>CLASSPATH for YARN applications. A comma-separated list of CLASSPATH entries</description> <name>yarn.application.classpath</name> <value>/app/hadoop/hadoop-2.5.2/conf,/app/hadoop/hadoop-2.5.2/share/hadoop/common/*,/app/hadoop/hadoop-2.5.2/share/hadoop/common/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/hdfs/*,/app/hadoop/hadoop-2.5.2/share/hadoop/hdfs/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/mapreduce/*,/app/hadoop/hadoop-2.5.2/share/hadoop/mapreduce/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/yarn/*,/app/hadoop/hadoop-2.5.2/share/hadoop/yarn/lib/* </value> </property> </configuration> 2. scp'd the config to the slave resource manager node 3. restart yarn on node 1. I am not sure if I missed anything. - Shekar On Thu, May 14, 2015 at 3:06 PM, Yan Fang <yanfang...@gmail.com> wrote: > Is the HA set correctly? The log looks like it's in the YARN setting side. > > Fang, Yan > yanfang...@gmail.com > > On Thu, May 14, 2015 at 12:29 PM, Shekar Tippur <ctip...@gmail.com> wrote: > > > Other observation I forgot to mention is that if I kill the rm and nm > > process, samza job seem to run properly. Only when 01 server is > rebooted, I > > seem to encounter this error and as a result, no jobs get processed. > > > > - Shekar > > > > On Thu, May 14, 2015 at 12:14 PM, Shekar Tippur <ctip...@gmail.com> > wrote: > > > > > Hello, > > > > > > I have setup redundancy on resource manager based on this doc > > > > > > https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html > > > I then shut down server 1 and was expecting that 02 server would take > > over. > > > > > > Instead I see this error. I am not sure if I am missing something. > > > > > > 2015-05-14 11:55:01,820 INFO [Node Status Updater] > > > retry.RetryInvocationHandler (RetryInvocationHandler.java:invoke(140)) > - > > > Exception while invoking nodeHeartbeat of class > > ResourceTrackerPBClientImpl > > > over rm2 after 19 fail over attempts. Trying to fail over after > sleeping > > > for 24180ms. > > > > > > java.net.ConnectException: Call From sprdargas403t/10.180.195.33 to > > > sprdargas403:8031 failed on connection exception: > > > java.net.ConnectException: Connection refused; For more details see: > > > http://wiki.apache.org/hadoop/ConnectionRefused > > > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native > Method) > > > > > > at > > > > > > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > > > > > > at > > > > > > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > > > > > > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > > > > > > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783) > > > > > > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730) > > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1415) > > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1364) > > > > > > at > > > > > > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206) > > > > > > at com.sun.proxy.$Proxy27.nodeHeartbeat(Unknown Source) > > > > > > at > > > > > > org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80) > > > > > > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source) > > > > > > at > > > > > > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > > > > > > at java.lang.reflect.Method.invoke(Method.java:606) > > > > > > at > > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) > > > > > > at > > > > > > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) > > > > > > at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source) > > > > > > at > > > > > > org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:512) > > > > > > at java.lang.Thread.run(Thread.java:745) > > > > > > Caused by: java.net.ConnectException: Connection refused > > > > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) > > > > > > at > sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739) > > > > > > at > > > > > > org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) > > > > > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529) > > > > > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493) > > > > > > at > > org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606) > > > > > > at > > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700) > > > > > > at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367) > > > > > > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463) > > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1382) > > > > > > ... 12 more > > > > > > 2015-05-14 11:55:01,965 INFO [Container Monitor] > > > monitor.ContainersMonitorImpl (ContainersMonitorImpl.java:run(408)) - > > > Memory usage of ProcessTree 21428 for container-id > > > container_1431628855028_0001_01_000001: 369.7 MB of 1 GB physical > memory > > > used; 1.4 GB of 2.1 GB virtual memory used > > > > > >