[
https://issues.apache.org/jira/browse/HIVE-28962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denys Kuzmenko updated HIVE-28962:
----------------------------------
Description:
Communication failure between coordinator and executor led to a task kill
(06:04:18.328Z)
{code:java}
app=query-coordinator-0/2025-05-16-06-00_query-coordinator_query-coordinator-0-7_638e93f4-c948-4fc0-ad05-1451fc4b7c53_0.log:<14>1
2025-05-16T06:04:18.328Z query-coordinator-0-7 query-coordinator 1
638e93f4-c948-4fc0-ad05-1451fc4b7c53 [mdc@38374
class="HistoryEventHandler.criticalEvents" level="INFO" thread="Dispatcher
thread
{Central}
"] [HISTORY][DAG:dag_1746256596566_0014_1634][Event:TASK_ATTEMPT_FINISHED]:
vertexName=Reducer 3,
taskAttemptId=attempt_1746256596566_0014_1634_07_000070_13,
creationTime=1747375458134, allocationTime=1747375458277,
startTime=1747375458294, finishTime=1747375458328, timeTaken=34, status=KILLED,
errorEnum=NODE_FAILED, diagnostics=Node with same host and port but with new
unique ID pinged,
nodeHttpAddress=[<|http://query-executor-0-16.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25002/]>,
counters=Counters: 1, org.apache.tez.common.counters.TaskCounter,
TASK_DURATION_MILLIS=34
{code:java}
{code}
an InterruptedException is thrown from TezProcessor.initializeAndRunProcessor
because of task attempt kill (06:04:27.442Z)
<11>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1
71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 class="tez.TezProcessor"
dagId="dag_1746256596566_0014_1634"
fragmentId="1746256596566_0014_1634_07_000070_13" level="ERROR"
queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5"
thread="TezTR-596566_14_1634_7_70_13"] java.lang.InterruptedException\r at
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)\r
at
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2090)\r
at
org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:147)\r
at
org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:107)\r
at
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:138)\r
at
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:133)\r
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:122)\r
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)\r
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)\r at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)\r
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:86)\r
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:72)\r
at java.base/java.security.AccessController.doPrivileged(Native Method)\r at
java.base/javax.security.auth.Subject.doAs(Subject.java:423)\r at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)\r
at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:72)\r
at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:42)\r
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)\r at
org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)\r
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\r at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r
at java.base/java.lang.Thread.run(Thread.java:829)\r
it already written a commit manifest before (06:04:27.442Z) <-- same timestamp,
but before
<14>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1
71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374
class="hive.HiveIcebergOutputCommitter" dagId="dag_1746256596566_0014_1634"
fragmentId="1746256596566_0014_1634_07_000070_13" level="INFO"
queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5"
thread="TezTR-596566_14_1634_7_70_13"] Created Iceberg commitTask manifest
file:
[s3a://s3-cdp-dub-prd-int-prd-464373096835/data/warehouse/tablespace/external/hive/adp_staging.db/fct_position_security_level_ext_2/temp/hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5-job_17462565965667_0014/task-70.forCommit]\rFilesForCommit\{dataFiles=[],
deleteFiles=[], replacedDataFiles=[], referencedDataFiles=[]}
was:
Communication failure between coordinator and executor lead to a task kill
(06:04:18.328Z)
{code}
app=query-coordinator-0/2025-05-16-06-00_query-coordinator_query-coordinator-0-7_638e93f4-c948-4fc0-ad05-1451fc4b7c53_0.log:<14>1
2025-05-16T06:04:18.328Z query-coordinator-0-7 query-coordinator 1
638e93f4-c948-4fc0-ad05-1451fc4b7c53 [mdc@38374
class="HistoryEventHandler.criticalEvents" level="INFO" thread="Dispatcher
thread
{Central}
"] [HISTORY][DAG:dag_1746256596566_0014_1634][Event:TASK_ATTEMPT_FINISHED]:
vertexName=Reducer 3,
taskAttemptId=attempt_1746256596566_0014_1634_07_000070_13,
creationTime=1747375458134, allocationTime=1747375458277,
startTime=1747375458294, finishTime=1747375458328, timeTaken=34, status=KILLED,
errorEnum=NODE_FAILED, diagnostics=Node with same host and port but with new
unique ID pinged,
nodeHttpAddress=[<|http://query-executor-0-16.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25002/]>,
counters=Counters: 1, org.apache.tez.common.counters.TaskCounter,
TASK_DURATION_MILLIS=34
{code:java}
{code}
an InterruptedException is thrown from TezProcessor.initializeAndRunProcessor
because of task attempt kill (06:04:27.442Z)
<11>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1
71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 class="tez.TezProcessor"
dagId="dag_1746256596566_0014_1634"
fragmentId="1746256596566_0014_1634_07_000070_13" level="ERROR"
queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5"
thread="TezTR-596566_14_1634_7_70_13"] java.lang.InterruptedException\r at
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)\r
at
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2090)\r
at
org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:147)\r
at
org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:107)\r
at
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:138)\r
at
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:133)\r
at
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:122)\r
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)\r
at
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)\r at
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)\r
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:86)\r
at
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:72)\r
at java.base/java.security.AccessController.doPrivileged(Native Method)\r at
java.base/javax.security.auth.Subject.doAs(Subject.java:423)\r at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)\r
at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:72)\r
at
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:42)\r
at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)\r at
org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)\r
at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\r at
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r
at
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r
at java.base/java.lang.Thread.run(Thread.java:829)\r
it already written a commit manifest before (06:04:27.442Z) <-- same timestamp,
but before
<14>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1
71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374
class="hive.HiveIcebergOutputCommitter" dagId="dag_1746256596566_0014_1634"
fragmentId="1746256596566_0014_1634_07_000070_13" level="INFO"
queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5"
thread="TezTR-596566_14_1634_7_70_13"] Created Iceberg commitTask manifest
file:
[s3a://s3-cdp-dub-prd-int-prd-464373096835/data/warehouse/tablespace/external/hive/adp_staging.db/fct_position_security_level_ext_2/temp/hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5-job_17462565965667_0014/task-70.forCommit]\rFilesForCommit\{dataFiles=[],
deleteFiles=[], replacedDataFiles=[], referencedDataFiles=[]}
> Killed Tez task should be aborted
> ---------------------------------
>
> Key: HIVE-28962
> URL: https://issues.apache.org/jira/browse/HIVE-28962
> Project: Hive
> Issue Type: Bug
> Components: Iceberg integration
> Reporter: Denys Kuzmenko
> Priority: Major
>
> Communication failure between coordinator and executor led to a task kill
> (06:04:18.328Z)
> {code:java}
> app=query-coordinator-0/2025-05-16-06-00_query-coordinator_query-coordinator-0-7_638e93f4-c948-4fc0-ad05-1451fc4b7c53_0.log:<14>1
> 2025-05-16T06:04:18.328Z query-coordinator-0-7 query-coordinator 1
> 638e93f4-c948-4fc0-ad05-1451fc4b7c53 [mdc@38374
> class="HistoryEventHandler.criticalEvents" level="INFO" thread="Dispatcher
> thread
> {Central}
> "] [HISTORY][DAG:dag_1746256596566_0014_1634][Event:TASK_ATTEMPT_FINISHED]:
> vertexName=Reducer 3,
> taskAttemptId=attempt_1746256596566_0014_1634_07_000070_13,
> creationTime=1747375458134, allocationTime=1747375458277,
> startTime=1747375458294, finishTime=1747375458328, timeTaken=34,
> status=KILLED, errorEnum=NODE_FAILED, diagnostics=Node with same host and
> port but with new unique ID pinged,
> nodeHttpAddress=[<|http://query-executor-0-16.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25002/]>,
> counters=Counters: 1, org.apache.tez.common.counters.TaskCounter,
> TASK_DURATION_MILLIS=34
> {code:java}
> {code}
> an InterruptedException is thrown from TezProcessor.initializeAndRunProcessor
> because of task attempt kill (06:04:27.442Z)
>
> <11>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1
> 71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 class="tez.TezProcessor"
> dagId="dag_1746256596566_0014_1634"
> fragmentId="1746256596566_0014_1634_07_000070_13" level="ERROR"
> queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5"
> thread="TezTR-596566_14_1634_7_70_13"] java.lang.InterruptedException\r at
> java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)\r
> at
> java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2090)\r
> at
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:147)\r
> at
> org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:107)\r
> at
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:138)\r
> at
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:133)\r
> at
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:122)\r
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)\r
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)\r
> at
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)\r
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:86)\r
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:72)\r
> at java.base/java.security.AccessController.doPrivileged(Native Method)\r at
> java.base/javax.security.auth.Subject.doAs(Subject.java:423)\r at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)\r
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:72)\r
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:42)\r
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)\r at
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)\r
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\r at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r
> at java.base/java.lang.Thread.run(Thread.java:829)\r
> it already written a commit manifest before (06:04:27.442Z) <-- same
> timestamp, but before
>
> <14>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1
> 71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374
> class="hive.HiveIcebergOutputCommitter" dagId="dag_1746256596566_0014_1634"
> fragmentId="1746256596566_0014_1634_07_000070_13" level="INFO"
> queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5"
> thread="TezTR-596566_14_1634_7_70_13"] Created Iceberg commitTask manifest
> file:
> [s3a://s3-cdp-dub-prd-int-prd-464373096835/data/warehouse/tablespace/external/hive/adp_staging.db/fct_position_security_level_ext_2/temp/hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5-job_17462565965667_0014/task-70.forCommit]\rFilesForCommit\{dataFiles=[],
> deleteFiles=[], replacedDataFiles=[], referencedDataFiles=[]}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)