I think I figured it out.
stop-yarn.sh and start-yarn.sh restarts resourcemanager on the same node
but not on all other nodes.
The workaround is to go to all rm nodes and start rm individually.
I see that these scripts invoke another command yarn-daemon.sh which takes
a argument --hosts. I tried to add this but dint do the trick. I atleast
have a workaround but may not be optimal.

- Shekar

On Fri, May 15, 2015 at 2:00 PM, Gustavo Anatoly <gustavoanat...@gmail.com>
wrote:

> Hi, Shekar.
>
> The failed happens when:
> *    sprdargas403t/10.180.195.33 to sprdargas403:8031*
>
> I suggest that you verify:
> 1)* $nmap -sT sprdargas * to check whether the port 8031 is open;
> 2) Use *traceroute* to check if the name of machine is being resolved
> correctly;
> 3) Check your /etc/hosts whether the names it's mapped correctly;
>      Something like that:
>
>      127.0.0.1         localhost
>   10.10.2.120       D07831
>
> I hope this helps.
>
> Thanks,
>
>
>
>
> 2015-05-15 15:26 GMT-03:00 Yan Fang <yanfang...@gmail.com>:
>
> > Hi Shekar,
> >
> > I do not have much experience in setting up the HA. So if I were you, I
> may
> > check 1) when you take the RM down, does the backup RM runs successfully?
> > 2) if the backup RM runs successfully, can you see the Samza Application
> > run in the Yarn UI (such as, localhost:8088?) 3) if can not see it, what
> > does Samza's log say?
> >
> > Thanks,
> >
> > Fang, Yan
> > yanfang...@gmail.com
> >
> > On Thu, May 14, 2015 at 3:31 PM, Shekar Tippur <ctip...@gmail.com>
> wrote:
> >
> > > Yan,
> > > I have followed the doc. Here is what was done ...
> > > 1. Setup the yarn-site.xml
> > >
> > > <configuration>
> > >
> > >  <property>
> > >
> > >   <name>yarn.resourcemanager.ha.enabled</name>
> > >
> > >   <value>true</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >   <name>yarn.resourcemanager.cluster-id</name>
> > >
> > >   <value>cluster1</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >   <name>yarn.resourcemanager.ha.rm-ids</name>
> > >
> > >   <value>rm1,rm2</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >   <name>yarn.resourcemanager.hostname.rm1</name>
> > >
> > >   <value>sprdargas402.</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >    <name>yarn.resourcemanager.hostname.rm2</name>
> > >
> > >    <value>sprdargas403.</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >   <description>Enable RM to recover state after starting. If true, then
> > > yarn.resourcemanager.store.class must be specified</description>
> > >
> > >   <name>yarn.resourcemanager.recovery.enabled</name>
> > >
> > >   <value>true</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >   <description>The class to use as the persistent store.</description>
> > >
> > >   <name>yarn.resourcemanager.store.class</name>
> > >
> > >
> > >
> > >
> >
> <value>org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >   <name>yarn.resourcemanager.zk-state-store.address</name>
> > >
> > >   <value>sprdargas402.:2181</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >   <name>yarn.resourcemanager.zk-address</name>
> > >
> > >
>  <value>sprdargas402.:2181,sprdargas403.:2181,sprdargas404.:2181</value>
> > >
> > >  </property>
> > >
> > >  <property>
> > >
> > >   <description>CLASSPATH for YARN applications. A comma-separated list
> of
> > > CLASSPATH entries</description>
> > >
> > >   <name>yarn.application.classpath</name>
> > >
> > >
> > >
> > >
> > >
> >
> <value>/app/hadoop/hadoop-2.5.2/conf,/app/hadoop/hadoop-2.5.2/share/hadoop/common/*,/app/hadoop/hadoop-2.5.2/share/hadoop/common/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/hdfs/*,/app/hadoop/hadoop-2.5.2/share/hadoop/hdfs/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/mapreduce/*,/app/hadoop/hadoop-2.5.2/share/hadoop/mapreduce/lib/*,/app/hadoop/hadoop-2.5.2/share/hadoop/yarn/*,/app/hadoop/hadoop-2.5.2/share/hadoop/yarn/lib/*
> > >
> > >   </value>
> > >
> > >  </property>
> > >
> > > </configuration>
> > >
> > >
> > > 2. scp'd the config to the slave resource manager node
> > >
> > > 3. restart yarn on node 1.
> > >
> > > I am not sure if I missed anything.
> > >
> > > - Shekar
> > >
> > > On Thu, May 14, 2015 at 3:06 PM, Yan Fang <yanfang...@gmail.com>
> wrote:
> > >
> > > > Is the HA set correctly? The log looks like it's in the YARN setting
> > > side.
> > > >
> > > > Fang, Yan
> > > > yanfang...@gmail.com
> > > >
> > > > On Thu, May 14, 2015 at 12:29 PM, Shekar Tippur <ctip...@gmail.com>
> > > wrote:
> > > >
> > > > > Other observation I forgot to mention is that if I kill the rm and
> nm
> > > > > process, samza job seem to run properly. Only when 01 server is
> > > > rebooted, I
> > > > > seem to encounter this error and as a result, no jobs get
> processed.
> > > > >
> > > > > - Shekar
> > > > >
> > > > > On Thu, May 14, 2015 at 12:14 PM, Shekar Tippur <ctip...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I have setup redundancy on resource manager based on this doc
> > > > > >
> > > > >
> > > >
> > >
> >
> https://hadoop.apache.org/docs/r2.4.1/hadoop-yarn/hadoop-yarn-site/ResourceManagerHA.html
> > > > > > I then shut down server 1 and was expecting that 02 server would
> > take
> > > > > over.
> > > > > >
> > > > > > Instead I see this error. I am not sure if I am missing
> something.
> > > > > >
> > > > > > 2015-05-14 11:55:01,820 INFO  [Node Status Updater]
> > > > > > retry.RetryInvocationHandler
> > > (RetryInvocationHandler.java:invoke(140))
> > > > -
> > > > > > Exception while invoking nodeHeartbeat of class
> > > > > ResourceTrackerPBClientImpl
> > > > > > over rm2 after 19 fail over attempts. Trying to fail over after
> > > > sleeping
> > > > > > for 24180ms.
> > > > > >
> > > > > > java.net.ConnectException: Call From sprdargas403t/10.180.195.33
> > to
> > > > > > sprdargas403:8031 failed on connection exception:
> > > > > > java.net.ConnectException: Connection refused; For more details
> > see:
> > > > > > http://wiki.apache.org/hadoop/ConnectionRefused
> > > > > >
> > > > > > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native
> > > > Method)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
> > > > > >
> > > > > > at
> java.lang.reflect.Constructor.newInstance(Constructor.java:526)
> > > > > >
> > > > > > at
> > org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:783)
> > > > > >
> > > > > > at
> org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:730)
> > > > > >
> > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1415)
> > > > > >
> > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1364)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:206)
> > > > > >
> > > > > > at com.sun.proxy.$Proxy27.nodeHeartbeat(Unknown Source)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.yarn.server.api.impl.pb.client.ResourceTrackerPBClientImpl.nodeHeartbeat(ResourceTrackerPBClientImpl.java:80)
> > > > > >
> > > > > > at sun.reflect.GeneratedMethodAccessor7.invoke(Unknown Source)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > > > > >
> > > > > > at java.lang.reflect.Method.invoke(Method.java:606)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
> > > > > >
> > > > > > at com.sun.proxy.$Proxy28.nodeHeartbeat(Unknown Source)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl$1.run(NodeStatusUpdaterImpl.java:512)
> > > > > >
> > > > > > at java.lang.Thread.run(Thread.java:745)
> > > > > >
> > > > > > Caused by: java.net.ConnectException: Connection refused
> > > > > >
> > > > > > at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> > > > > >
> > > > > > at
> > > >
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> > > > > >
> > > > > > at
> > > > > >
> > > > >
> > > >
> > >
> >
> org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206)
> > > > > >
> > > > > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:529)
> > > > > >
> > > > > > at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:493)
> > > > > >
> > > > > > at
> > > > >
> > >
> org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:606)
> > > > > >
> > > > > > at
> > > > >
> > org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:700)
> > > > > >
> > > > > > at
> > > org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:367)
> > > > > >
> > > > > > at org.apache.hadoop.ipc.Client.getConnection(Client.java:1463)
> > > > > >
> > > > > > at org.apache.hadoop.ipc.Client.call(Client.java:1382)
> > > > > >
> > > > > > ... 12 more
> > > > > >
> > > > > > 2015-05-14 11:55:01,965 INFO  [Container Monitor]
> > > > > > monitor.ContainersMonitorImpl
> > (ContainersMonitorImpl.java:run(408)) -
> > > > > > Memory usage of ProcessTree 21428 for container-id
> > > > > > container_1431628855028_0001_01_000001: 369.7 MB of 1 GB physical
> > > > memory
> > > > > > used; 1.4 GB of 2.1 GB virtual memory used
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to