Hi Guys, Just a update on this issues. I turned on our debugger when running our jobs. Was able to nail it down to "Connection refused" error. I am not sure what has caused this error and how it can be prevent. Looking at the datanodes when the job was running, I did not notice anything wrong. If anyone has seen this issue and can shred some light on it, that would be great. Error logs are given below. Interesting thing is it does not happen every time. It a sporadic occurrence and if we restart our hive job again, it goes through without any errors.
*Error Logs:* 16/05/10 16:10:14 DEBUG client.TezClientUtils: Connecting to Tez AM at maskserver21/1.1.1.1:56372 16/05/10 16:10:15 DEBUG security.UserGroupInformation: PrivilegedAction as:aiprod (auth:SIMPLE) from:org.apache.tez.client.TezClientUtils.getAMProxy(TezClientUtils.java:826) 16/05/10 16:10:15 DEBUG ipc.Client: getting client out of cache: org.apache.hadoop.ipc.Client@1c8825a5 16/05/10 16:10:40 DEBUG ipc.Client: IPC Client (1495065854) connection to maskserver1/2.2.2.2:8020 from aiprod: closed 16/05/10 16:10:40 DEBUG ipc.Client: IPC Client (1495065854) connection to maskserver1/2.2.2.2:8020 from aiprod: stopped, remaining connections 1 16/05/10 16:10:44 DEBUG ipc.Client: IPC Client (1495065854) connection to maskserver11/2.2.2.21:8050 from aiprod: closed 16/05/10 16:10:44 DEBUG ipc.Client: IPC Client (1495065854) connection to maskserver11/2.2.2.21:8050 from aiprod: stopped, remaining connections 0 16/05/10 16:21:10 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 executed 16/05/10 16:21:10 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 expired 16/05/10 16:21:10 DEBUG hdfs.LeaseRenewer: Lease renewer daemon for [] with renew id 1 exited 16/05/10 16:21:11 DEBUG ipc.Client: The ping interval is 60000 ms. 16/05/10 16:21:11 DEBUG ipc.Client: Connecting to maskserver21/1.1.1.1:56372 16/05/10 16:21:12 INFO ipc.Client: Retrying connect to server: maskserver21/ 1.1.1.1:56372. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 16/05/10 16:21:13 INFO ipc.Client: Retrying connect to server: maskserver21/ 1.1.1.1:56372. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) . . . 16/05/10 16:22:01 INFO ipc.Client: Retrying connect to server: maskserver21/ 1.1.1.1:56372. Already tried 49 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 MILLISECONDS) 16/05/10 16:22:01 DEBUG ipc.Client: closing ipc connection to maskserver21/ 1.1.1.1:56372: Connection refused java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) at $Proxy31.getAMStatus(Unknown Source) at org.apache.tez.client.TezClient.getAppMasterStatus(TezClient.java:535) at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:610) at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212) at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:122) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:454) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:672) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) 16/05/10 16:22:01 DEBUG ipc.Client: IPC Client (1495065854) connection to maskserver21/1.1.1.1:56372 from aiprod: closed 16/05/10 16:22:01 INFO client.TezClient: Failed to retrieve AM Status via proxy com.google.protobuf.ServiceException: java.net.ConnectException: Call From maskserver2/3.3.3.3 to maskserver21:56372 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:246) at $Proxy31.getAMStatus(Unknown Source) at org.apache.tez.client.TezClient.getAppMasterStatus(TezClient.java:535) at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:610) at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212) at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:122) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:454) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:672) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: java.net.ConnectException: Call From maskserver2/3.3.3.3 to maskserver21:56372 failed on connection exception: java.net.ConnectException: Connection refused; For more details see: http://wiki.apache.org/hadoop/ConnectionRefused at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:791) at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:731) at org.apache.hadoop.ipc.Client.call(Client.java:1473) at org.apache.hadoop.ipc.Client.call(Client.java:1400) at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:232) ... 14 more Caused by: java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:567) at org.apache.hadoop.net.SocketIOWithTimeout.connect(SocketIOWithTimeout.java:206) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:530) at org.apache.hadoop.net.NetUtils.connect(NetUtils.java:494) at org.apache.hadoop.ipc.Client$Connection.setupConnection(Client.java:608) at org.apache.hadoop.ipc.Client$Connection.setupIOstreams(Client.java:706) at org.apache.hadoop.ipc.Client$Connection.access$2800(Client.java:369) at org.apache.hadoop.ipc.Client.getConnection(Client.java:1522) at org.apache.hadoop.ipc.Client.call(Client.java:1439) ... 16 more 16/05/10 16:22:02 DEBUG ipc.Client: The ping interval is 60000 ms. 16/05/10 16:22:02 DEBUG ipc.Client: Connecting to maskserver11/2.2.2.21:8050 16/05/10 16:22:02 DEBUG ipc.Client: IPC Client (1495065854) connection to maskserver11/2.2.2.21:8050 from aiprod: starting, having connections 1 16/05/10 16:22:02 DEBUG ipc.Client: IPC Client (1495065854) connection to maskserver11/2.2.2.21:8050 from aiprod sending #70 16/05/10 16:22:02 DEBUG ipc.Client: IPC Client (1495065854) connection to maskserver11/2.2.2.21:8050 from aiprod got value #70 16/05/10 16:22:02 DEBUG ipc.ProtobufRpcEngine: Call: getApplicationReport took 3ms Exception in thread "main" java.lang.RuntimeException: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. No cluster diagnostics found. at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:457) at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:672) at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.run(RunJar.java:221) at org.apache.hadoop.util.RunJar.main(RunJar.java:136) Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. No cluster diagnostics found. at org.apache.tez.client.TezClient.waitTillReady(TezClient.java:612) at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212) at org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:122) at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:454) ... 8 more Regards Khaja Hussain On Sat, Apr 9, 2016 at 5:13 PM, Khaja Hussain <khaja....@gmail.com> wrote: > Hi Guys - > > I am getting this error messages "java.lang.RuntimeException: > org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. > No cluster diagnostics found" for some of the jobs I am running. The > cluster are heavily loaded. I am suspecting this to be some kind of > time-out issue. Any pointer or tez setting that needs to be looked into > would be helpful. From the log I can infer even the application has not > started. > > > *Logs:* > > *16/04/09 14:15:11 WARN conf.HiveConf: HiveConf of name > hive.semantic.analyzer.factory.impl does not exist* > > *Logging initialized using configuration in > file:/etc/hive/conf/hive-log4j.properties* > *SLF4J: Class path contains multiple SLF4J bindings.* > *SLF4J: Found binding in > [jar:file:/usr/hdp/2.2.4.2-2/hadoop/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]* > *SLF4J: Found binding in > [jar:file:/usr/hdp/2.2.4.2-2/hive/lib/hive-jdbc-0.14.0.2.2.4.2-2-standalone.jar!/org/slf4j/impl/StaticLoggerBinder.class]* > *SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings > <http://www.slf4j.org/codes.html#multiple_bindings> for an explanation.* > *SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]* > *Exception in thread "main" java.lang.RuntimeException: > org.apache.tez.dag.api.SessionNotRunning: TezSession has already shutdown. > No cluster diagnostics found.* > * at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:457)* > * at org.apache.hadoop.hive.cli.CliDriver.run(CliDriver.java:672)* > * at org.apache.hadoop.hive.cli.CliDriver.main(CliDriver.java:616)* > * at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)* > * at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)* > * at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)* > * at java.lang.reflect.Method.invoke(Method.java:597)* > * at org.apache.hadoop.util.RunJar.run(RunJar.java:221)* > * at org.apache.hadoop.util.RunJar.main(RunJar.java:136)* > *Caused by: org.apache.tez.dag.api.SessionNotRunning: TezSession has > already shutdown. No cluster diagnostics found.* > * at > org.apache.tez.client.TezClient.waitTillReady(TezClient.java:612)* > * at > org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:212)* > * at > org.apache.hadoop.hive.ql.exec.tez.TezSessionState.open(TezSessionState.java:122)* > * at > org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:454)* > * ... 8 more* > > > Regards > Khaja Hussain >