[ https://issues.apache.org/jira/browse/HIVE-15722?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Siddharth Seth updated HIVE-15722: ---------------------------------- Resolution: Fixed Fix Version/s: 2.2.0 Status: Resolved (was: Patch Available) > LLAP: Avoid marking a query as complete if the AMReporter runs into an error > ---------------------------------------------------------------------------- > > Key: HIVE-15722 > URL: https://issues.apache.org/jira/browse/HIVE-15722 > Project: Hive > Issue Type: Bug > Reporter: Siddharth Seth > Assignee: Siddharth Seth > Fix For: 2.2.0 > > Attachments: HIVE-15722.01.patch, HIVE-15722.02.patch, > HIVE-15722.03.patch, HIVE-15722.04.patch > > > When the AMReporter runs into an error (typically intermittent), we end up > killing all fragments on the daemon. This is done by marking the query as > complete. > The AM would continue to try scheduling on this node - which would lead to > task failures if the daemon structures are updated. > Instead of clearing the structures, it's better to kill the fragments, and > let a queryComplete call come in from the AM. > Later, we could make enhancements in the AM to avoid such nodes. That's not > simple though, since the AM will not find out what happened due to the > communication failure from the daemon. > Leads to > {code} > org.apache.hadoop.ipc.RemoteException(java.lang.RuntimeException): Dag > query16 already complete. Rejecting fragment [Map 7, 29, 0] > at > org.apache.hadoop.hive.llap.daemon.impl.QueryTracker.registerFragment(QueryTracker.java:149) > at > org.apache.hadoop.hive.llap.daemon.impl.ContainerRunnerImpl.submitWork(ContainerRunnerImpl.java:226) > at > org.apache.hadoop.hive.llap.daemon.impl.LlapDaemon.submitWork(LlapDaemon.java:487) > at > org.apache.hadoop.hive.llap.daemon.impl.LlapProtocolServerImpl.submitWork(LlapProtocolServerImpl.java:101) > at > org.apache.hadoop.hive.llap.daemon.rpc.LlapDaemonProtocolProtos$LlapDaemonProtocol$2.callBlockingMethod(LlapDaemonProtocolProtos.java:16728) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:640) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:982) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2313) > at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2309) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1724) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2307) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)