[ https://issues.apache.org/jira/browse/HIVE-19935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sergey Shelukhin updated HIVE-19935: ------------------------------------ Resolution: Fixed Fix Version/s: 4.0.0 Status: Resolved (was: Patch Available) Committed to master. Thanks for the review! > Hive WM session killed: Failed to update LLAP tasks count > --------------------------------------------------------- > > Key: HIVE-19935 > URL: https://issues.apache.org/jira/browse/HIVE-19935 > Project: Hive > Issue Type: Bug > Components: HiveServer2 > Affects Versions: 3.1.0 > Reporter: Thai Bui > Assignee: Sergey Shelukhin > Priority: Minor > Fix For: 4.0.0 > > Attachments: HIVE-19935.01.patch, HIVE-19935.patch > > > I'm getting this error with WM feature quite frequently. It causes AM > containers to shut down and a new one created to replace it. > {noformat} > 018-06-18T19:06:49,969 INFO [Thread-250] > monitoring.RenderStrategy$LogToFileFunction: Map 1: 313(+270)/641 > 2018-06-18T19:06:49,988 INFO [NotificationEventPoll 0] > metastore.HiveMetaStore: 4: get_config_value: > name=metastore.batch.retrieve.max defaultValue=50 > 2018-06-18T19:06:49,988 INFO [NotificationEventPoll 0] HiveMetaStore.audit: > ugi=hive ip=unknown-ip-addr cmd=get_config_value: > name=metastore.batch.retrieve.max defaultValue=50 > 2018-06-18T19:06:50,204 INFO [pool-29-thread-1] tez.TriggerValidatorRunnable: > Query: hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4. Trigger { > name: alluxio_medium, expression: ALLUXIO_BYTES_READ > > 6442450944, action: MOVE TO medium } violated. Current value: 7184667126. > Applying action. > 2018-06-18T19:06:50,205 INFO [pool-29-thread-1] tez.WorkloadManager: Queued > move session: 49be39e5-875c-4cfe-8601-7fe84dd57e0c moving from default to > medium > 2018-06-18T19:06:50,205 INFO [Workload management master] > tez.WorkloadManager: Processing current events > 2018-06-18T19:06:50,205 INFO [Workload management master] > tez.WorkloadManager: Handling move session event: > 49be39e5-875c-4cfe-8601-7fe84dd57e0c moving from default to medium > 2018-06-18T19:06:50,205 INFO [Workload management master] > tez.WorkloadManager: Subscribed to counters: [S3A_BYTES_READ, BYTES_READ, > ALLUXIO_BYTES_READ] > 2018-06-18T19:06:50,205 INFO [pool-29-thread-1] > tez.KillMoveTriggerActionHandler: Moved session > 49be39e5-875c-4cfe-8601-7fe84dd57e0c to pool medium > 2018-06-18T19:06:50,205 INFO [Workload management master] > tez.GuaranteedTasksAllocator: Updating 49be39e5-875c-4cfe-8601-7fe84dd57e0c > with 144 guaranteed tasks > 2018-06-18T19:06:50,205 INFO [Workload management master] tez.WmEvent: Added > WMEvent: EventType: MOVE EventStartTimestamp: 1529348810205 elapsedTime: 0 > wmTezSessionInfo:SessionId: 49be39e5-875c-4cfe-8601-7fe > 84dd57e0c Pool: medium Cluster %: 30.0 > 2018-06-18T19:06:50,234 INFO [StateChangeNotificationHandler] > impl.ZkRegistryBase$InstanceStateChangeListener: CHILD_UPDATED for zknode > /user-hive/llap/workers/worker-0000001571 > 2018-06-18T19:06:50,235 INFO [StateChangeNotificationHandler] > tez.TezSessionPool: AM for 49be39e5-875c-4cfe-8601-7fe84dd57e0c, v.1571 has > updated; updating [sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c, qu > eueName=llap, user=hive, doAs=false, isOpen=true, isDefault=true, expires in > 586277120ms, WM state poolName=medium, clusterFraction=0.3, > queryId=hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4, killR > eason=null] with an endpoint at 32769 > 2018-06-18T19:06:50,235 INFO [StateChangeNotificationHandler] > tez.TezSessionState: Ignoring an outdated info update 1571: TezAmInstance > [49be39e5-875c-4cfe-8601-7fe84dd57e0c, host=ip-10-8-121-231.data.bazaar > voice.com, rpcPort=33365, pluginPort=32769, token=null] > 2018-06-18T19:06:50,323 ERROR [TaskCommunicator # 4] > tez.GuaranteedTasksAllocator: Failed to update guaranteed tasks count for the > session sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c, queueName=llap, user= > hive, doAs=false, isOpen=true, isDefault=true, expires in 586277032ms, WM > state poolName=medium, clusterFraction=0.3, > queryId=hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4, > killReason=null > com.google.protobuf.ServiceException: > org.apache.hadoop.ipc.RemoteException(java.lang.NullPointerException): > java.lang.NullPointerException > at > org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.checkAndSendGuaranteedStateUpdate(LlapTaskSchedulerService.java:596) > at > org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.updateGuaranteedCount(LlapTaskSchedulerService.java:581) > at > org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.updateQuery(LlapTaskSchedulerService.java:3041) > at > org.apache.hadoop.hive.llap.tezplugins.endpoint.LlapPluginServerImpl.updateQuery(LlapPluginServerImpl.java:57) > at > org.apache.hadoop.hive.llap.plugin.rpc.LlapPluginProtocolProtos$LlapPluginProtocol$2.callBlockingMethod(LlapPluginProtocolProtos.java:835) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:242) > ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?] > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116) > ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?] > at com.sun.proxy.$Proxy111.updateQuery(Unknown Source) ~[?:?] > at > org.apache.hadoop.hive.llap.impl.LlapPluginProtocolClientImpl.updateQuery(LlapPluginProtocolClientImpl.java:42) > ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.tez.LlapPluginEndpointClientImpl$SendUpdateQueryCallable.call(LlapPluginEndpointClientImpl.java:128) > ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] > at > org.apache.hadoop.hive.ql.exec.tez.LlapPluginEndpointClientImpl$SendUpdateQueryCallable.call(LlapPluginEndpointClientImpl.java:105) > ~[hive-exec-3.1.0-SNAPSHOT.jar:3.1.0-SNAPSHOT] > at > com.google.common.util.concurrent.TrustedListenableFutureTask$TrustedFutureInterruptibleTask.runInterruptibly(TrustedListenableFutureTask.java:108) > [guava-19.0.jar:?] > at > com.google.common.util.concurrent.InterruptibleTask.run(InterruptibleTask.java:41) > [guava-19.0.jar:?] > at > com.google.common.util.concurrent.TrustedListenableFutureTask.run(TrustedListenableFutureTask.java:77) > [guava-19.0.jar:?] > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > [?:1.8.0_171] > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > [?:1.8.0_171] > at java.lang.Thread.run(Thread.java:748) [?:1.8.0_171] > Caused by: org.apache.hadoop.ipc.RemoteException: > java.lang.NullPointerException > at > org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.checkAndSendGuaranteedStateUpdate(LlapTaskSchedulerService.java:596) > at > org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.updateGuaranteedCount(LlapTaskSchedulerService.java:581) > at > org.apache.hadoop.hive.llap.tezplugins.LlapTaskSchedulerService.updateQuery(LlapTaskSchedulerService.java:3041) > at > org.apache.hadoop.hive.llap.tezplugins.endpoint.LlapPluginServerImpl.updateQuery(LlapPluginServerImpl.java:57) > at > org.apache.hadoop.hive.llap.plugin.rpc.LlapPluginProtocolProtos$LlapPluginProtocol$2.callBlockingMethod(LlapPluginProtocolProtos.java:835) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:523) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:991) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:872) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:818) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1682) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2678) > at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1491) > ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?] > at org.apache.hadoop.ipc.Client.call(Client.java:1437) > ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?] > at org.apache.hadoop.ipc.Client.call(Client.java:1347) > ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?] > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228) > ~[hadoop-common-3.1.1-SNAPSHOT.2.6.1.0-129.jar:?] > ... 11 more > 2018-06-18T19:06:50,323 INFO [Workload management master] > tez.WorkloadManager: Processing current events > 2018-06-18T19:06:50,323 INFO [Workload management master] > tez.WorkloadManager: Update failed for > sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c, queueName=llap, user=hive, > doAs=false, isOpen=true, isDefault=true, expires in 586277032ms, WM state > poolName=medium, clusterFraction=0.3, > queryId=hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4, > killReason=null > 2018-06-18T19:06:50,324 INFO [Workload management master] > tez.WorkloadManager: Replacing > sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c, queueName=llap, user=hive, > doAs=false, isOpen=true, isDefault=true, expires in 586277031ms, WM state > poolName=null, clusterFraction=null, > queryId=hive_20180618190637_e65869b8-10be-4880-a8d3-84989bd055b4, > killReason=Failed to update resource allocation with a new session > 2018-06-18T19:06:50,325 INFO [Workload management worker 0] > tez.WorkloadManager: Created new interactive session object > b89aaebf-aeaa-4b76-974a-7047592a186b > 2018-06-18T19:06:50,325 INFO [Workload management worker 0] > tez.TezSessionState: Closing Tez Session > 2018-06-18T19:06:50,325 INFO [Workload management worker 0] client.TezClient: > Shutting down Tez Session, > sessionName=HIVE-49be39e5-875c-4cfe-8601-7fe84dd57e0c, > applicationId=application_1528322657674_0427 > 2018-06-18T19:06:50,353 INFO [Workload management worker 0] > tez.TezSessionState: Attemting to clean up resources for > 49be39e5-875c-4cfe-8601-7fe84dd57e0c: > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/49be39e5-875c-4cfe-8601-7fe84dd57e0c-resources; > 0 additional files, 2 localized resources > 2018-06-18T19:06:50,354 INFO [Workload management worker 0] > tez.TezSessionState: User of session id b89aaebf-aeaa-4b76-974a-7047592a186b > is hive > 2018-06-18T19:06:50,356 INFO [Workload management worker 0] tez.DagUtils: > Localizing resource because it does not exist: > file:/usr/hdp/current/hive-server2-hive2/lib/hive-hcatalog-core.jar to dest: > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/b89aaebf-aeaa-4b76-974a-7047592a186b-resources/hive-hcatalog-core.jar > 2018-06-18T19:06:50,467 INFO [Workload management worker 0] tez.DagUtils: > Resource modification time: 1529348810466 for > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/b89aaebf-aeaa-4b76-974a-7047592a186b-resources/hive-hcatalog-core.jar > 2018-06-18T19:06:50,467 INFO [Workload management worker 0] tez.DagUtils: > Localizing resource because it does not exist: > file:/usr/hdp/3.0.0.0-1064/hive2/auxlib/alluxio-core-client-runtime.jar to > dest: > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/b89aaebf-aeaa-4b76-974a-7047592a186b-resources/alluxio-core-client-runtime.jar > 2018-06-18T19:06:50,504 INFO [Thread-250] > monitoring.RenderStrategy$LogToFileFunction: Map 1: 361(+0)/641 > 2018-06-18T19:06:50,785 INFO [Workload management worker 0] tez.DagUtils: > Resource modification time: 1529348810784 for > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/tmp/hive/hive/_tez_session_dir/b89aaebf-aeaa-4b76-974a-7047592a186b-resources/alluxio-core-client-runtime.jar > 2018-06-18T19:06:50,785 INFO [Workload management worker 0] > tez.TezSessionState: Created new resources: null > 2018-06-18T19:06:50,786 INFO [Workload management worker 0] tez.DagUtils: Jar > dir is null / directory doesn't exist. Choosing HIVE_INSTALL_DIR - > /user/hive/.hiveJars > 2018-06-18T19:06:50,788 INFO [Workload management worker 0] tez.DagUtils: > Resource modification time: 1525405767020 for > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/user/hive/.hiveJars/hive-exec-3.1.0-SNAPSHOT-7db4045ca6fe10361ffcde371b4327bc911d68f174562dd75f00abce0c42fa36.jar > 2018-06-18T19:06:50,791 INFO [Workload management worker 0] tez.DagUtils: Jar > dir is null / directory doesn't exist. Choosing HIVE_INSTALL_DIR - > /user/hive/.hiveJars > 2018-06-18T19:06:50,792 INFO [Workload management worker 0] tez.DagUtils: > Resource modification time: 1525405767036 for > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/user/hive/.hiveJars/hive-llap-tez-3.1.0-SNAPSHOT-1530a700e8bc86603cf11cd8d0a2518f21a7a7c2ac09eb3af7782017ec05c4d3.jar > 2018-06-18T19:06:50,792 INFO [Workload management worker 0] tez.DagUtils: Jar > dir is null / directory doesn't exist. Choosing HIVE_INSTALL_DIR - > /user/hive/.hiveJars > 2018-06-18T19:06:50,793 INFO [Workload management worker 0] tez.DagUtils: > Resource modification time: 1525405767020 for > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/user/hive/.hiveJars/hive-exec-3.1.0-SNAPSHOT-7db4045ca6fe10361ffcde371b4327bc911d68f174562dd75f00abce0c42fa36.jar > 2018-06-18T19:06:50,794 INFO [Workload management worker 0] tez.DagUtils: Jar > dir is null / directory doesn't exist. Choosing HIVE_INSTALL_DIR - > /user/hive/.hiveJars > 2018-06-18T19:06:50,796 INFO [Thread-250] SessionState: Status: Killed > 2018-06-18T19:06:50,796 ERROR [Thread-250] SessionState: Dag received > [DAG_TERMINATE, DAG_KILL] in RUNNING state. > 2018-06-18T19:06:50,796 ERROR [Thread-250] SessionState: Received message to > shutdown AM from hive (auth:SIMPLE) at 10.8.101.64 > 2018-06-18T19:06:50,796 ERROR [Thread-250] SessionState: Vertex killed, > vertexName=Map 1, vertexId=vertex_1528322657674_0427_1_00, > diagnostics=[Vertex received Kill while in RUNNING state., Vertex did not > succeed due to DAG_TERMINATED, failedTasks:0 killedTasks:280, Vertex > vertex_1528322657674_0427_1_00 [Map 1] killed/failed due to:DAG_TERMINATED] > 2018-06-18T19:06:50,796 ERROR [Thread-250] SessionState: DAG did not succeed > due to DAG_KILL. failedVertices:0 killedVertices:1 > 2018-06-18T19:06:50,797 INFO [Workload management master] > tez.WorkloadManager: Processing current events > 2018-06-18T19:06:50,797 INFO [Workload management worker 0] tez.DagUtils: > Resource modification time: 1525405767020 for > hdfs://ip-10-8-101-64.data.bazaarvoice.com:8020/user/hive/.hiveJars/hive-exec-3.1.0-SNAPSHOT-7db4045ca6fe10361ffcde371b4327bc911d68f174562dd75f00abce0c42fa36.jar > 2018-06-18T19:06:50,797 INFO [Workload management master] > tez.WorkloadManager: Returning > sessionId=49be39e5-875c-4cfe-8601-7fe84dd57e0c, queueName=llap, user=hive, > doAs=false, isOpen=false, isDefault=true, expires in 586276558ms, WM state > poolName=null, clusterFraction=null, queryId=null, killReason=Failed to > update resource allocation{noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)