[ 
https://issues.apache.org/jira/browse/HIVE-28962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denys Kuzmenko updated HIVE-28962:
----------------------------------
    Description: 
Communication failure between coordinator and executor lead to a task kill 
(06:04:18.328Z)
 
d2/app=query-coordinator-0/2025-05-16-06-00_query-coordinator_query-coordinator-0-7_638e93f4-c948-4fc0-ad05-1451fc4b7c53_0.log:<14>1
 2025-05-16T06:04:18.328Z query-coordinator-0-7 query-coordinator 1 
638e93f4-c948-4fc0-ad05-1451fc4b7c53 [mdc@38374 
class="HistoryEventHandler.criticalEvents" level="INFO" thread="Dispatcher 
thread \{Central}"] 
[HISTORY][DAG:dag_1746256596566_0014_1634][Event:TASK_ATTEMPT_FINISHED]: 
vertexName=Reducer 3, 
taskAttemptId=attempt_1746256596566_0014_1634_07_000070_13, 
creationTime=1747375458134, allocationTime=1747375458277, 
startTime=1747375458294, finishTime=1747375458328, timeTaken=34, status=KILLED, 
errorEnum=NODE_FAILED, diagnostics=Node with same host and port but with new 
unique ID pinged, 
nodeHttpAddress=[http://query-executor-0-16.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25002|http://query-executor-0-16.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25002/],
 counters=Counters: 1, org.apache.tez.common.counters.TaskCounter, 
TASK_DURATION_MILLIS=34
in the meantime, the executor kept on running the task attempt (06:04:27.340Z)
 
<14>1 2025-05-16T06:04:27.340Z query-executor-0-16 query-executor 1 
71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 
class="task.TaskRunner2Callable" dagId="dag_1746256596566_0014_1634" 
fragmentId="1746256596566_0014_1634_07_000070_13" level="INFO" 
queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5" 
thread="TezTR-596566_14_1634_7_70_13"] Initializing task, 
taskAttemptId=attempt_1746256596566_0014_1634_07_000070_13
but attempt=28 was started by the coordinator right after it noticed that 
attempt=13 failed (06:04:20.153Z)
 
d2/app=query-coordinator-0/2025-05-16-06-00_query-coordinator_query-coordinator-0-7_638e93f4-c948-4fc0-ad05-1451fc4b7c53_0.log:<14>1
 2025-05-16T06:04:20.153Z query-coordinator-0-7 query-coordinator 1 
638e93f4-c948-4fc0-ad05-1451fc4b7c53 [mdc@38374 
class="tezplugins.LlapTaskSchedulerService" level="INFO" 
thread="LlapScheduler"] Assigned #16318391, 
task=TaskInfo\{task=attempt_1746256596566_0014_1634_07_000070_28, priority=140, 
startTime=0, containerId=null, uniqueId=16318390, 
localityDelayTimeout=1119114761} on 
node=\{query-executor-0-1.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25000,
 id=2c29ab22-c544-438e-bddf-dfbaebaa29fb, canAcceptTask=true, st=17, ac=5, 
commF=false, disabled=false}, to container=container_111101111_0014_01_32636783
and even if it was successfully killed, InterruptedException is typically 
because of task attempt kill in this scenario (06:04:27.442Z)
 
<11>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1 
71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 class="tez.TezProcessor" 
dagId="dag_1746256596566_0014_1634" 
fragmentId="1746256596566_0014_1634_07_000070_13" level="ERROR" 
queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5" 
thread="TezTR-596566_14_1634_7_70_13"] java.lang.InterruptedException\r at 
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)\r
 at 
java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2090)\r
 at 
org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:147)\r
 at 
org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:107)\r
 at 
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:138)\r
 at 
org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:133)\r
 at 
org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:122)\r
 at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)\r
 at 
org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)\r at 
org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)\r
 at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:86)\r
 at 
org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:72)\r
 at java.base/java.security.AccessController.doPrivileged(Native Method)\r at 
java.base/javax.security.auth.Subject.doAs(Subject.java:423)\r at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)\r
 at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:72)\r
 at 
org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:42)\r
 at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)\r at 
org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)\r
 at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\r at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r
 at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r
 at java.base/java.lang.Thread.run(Thread.java:829)\r
it already written a commit manifest before (06:04:27.442Z) <-- same timestamp, 
but before
 
<14>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1 
71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 
class="hive.HiveIcebergOutputCommitter" dagId="dag_1746256596566_0014_1634" 
fragmentId="1746256596566_0014_1634_07_000070_13" level="INFO" 
queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5" 
thread="TezTR-596566_14_1634_7_70_13"] Created Iceberg commitTask manifest 
file: 
[s3a://s3-cdp-dub-prd-int-prd-464373096835/data/warehouse/tablespace/external/hive/adp_staging.db/fct_position_security_level_ext_2/temp/hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5-job_17462565965667_0014/task-70.forCommit]\rFilesForCommit\{dataFiles=[],
 deleteFiles=[], replacedDataFiles=[], referencedDataFiles=[]}

> Killed Tez task should be aborted
> ---------------------------------
>
>                 Key: HIVE-28962
>                 URL: https://issues.apache.org/jira/browse/HIVE-28962
>             Project: Hive
>          Issue Type: Bug
>          Components: Iceberg integration
>            Reporter: Denys Kuzmenko
>            Priority: Major
>
> Communication failure between coordinator and executor lead to a task kill 
> (06:04:18.328Z)
>  
> d2/app=query-coordinator-0/2025-05-16-06-00_query-coordinator_query-coordinator-0-7_638e93f4-c948-4fc0-ad05-1451fc4b7c53_0.log:<14>1
>  2025-05-16T06:04:18.328Z query-coordinator-0-7 query-coordinator 1 
> 638e93f4-c948-4fc0-ad05-1451fc4b7c53 [mdc@38374 
> class="HistoryEventHandler.criticalEvents" level="INFO" thread="Dispatcher 
> thread \{Central}"] 
> [HISTORY][DAG:dag_1746256596566_0014_1634][Event:TASK_ATTEMPT_FINISHED]: 
> vertexName=Reducer 3, 
> taskAttemptId=attempt_1746256596566_0014_1634_07_000070_13, 
> creationTime=1747375458134, allocationTime=1747375458277, 
> startTime=1747375458294, finishTime=1747375458328, timeTaken=34, 
> status=KILLED, errorEnum=NODE_FAILED, diagnostics=Node with same host and 
> port but with new unique ID pinged, 
> nodeHttpAddress=[http://query-executor-0-16.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25002|http://query-executor-0-16.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25002/],
>  counters=Counters: 1, org.apache.tez.common.counters.TaskCounter, 
> TASK_DURATION_MILLIS=34
> in the meantime, the executor kept on running the task attempt (06:04:27.340Z)
>  
> <14>1 2025-05-16T06:04:27.340Z query-executor-0-16 query-executor 1 
> 71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 
> class="task.TaskRunner2Callable" dagId="dag_1746256596566_0014_1634" 
> fragmentId="1746256596566_0014_1634_07_000070_13" level="INFO" 
> queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5" 
> thread="TezTR-596566_14_1634_7_70_13"] Initializing task, 
> taskAttemptId=attempt_1746256596566_0014_1634_07_000070_13
> but attempt=28 was started by the coordinator right after it noticed that 
> attempt=13 failed (06:04:20.153Z)
>  
> d2/app=query-coordinator-0/2025-05-16-06-00_query-coordinator_query-coordinator-0-7_638e93f4-c948-4fc0-ad05-1451fc4b7c53_0.log:<14>1
>  2025-05-16T06:04:20.153Z query-coordinator-0-7 query-coordinator 1 
> 638e93f4-c948-4fc0-ad05-1451fc4b7c53 [mdc@38374 
> class="tezplugins.LlapTaskSchedulerService" level="INFO" 
> thread="LlapScheduler"] Assigned #16318391, 
> task=TaskInfo\{task=attempt_1746256596566_0014_1634_07_000070_28, 
> priority=140, startTime=0, containerId=null, uniqueId=16318390, 
> localityDelayTimeout=1119114761} on 
> node=\{query-executor-0-1.query-executor-0-service.compute-1729926174-gnrw.svc.cluster.local:25000,
>  id=2c29ab22-c544-438e-bddf-dfbaebaa29fb, canAcceptTask=true, st=17, ac=5, 
> commF=false, disabled=false}, to 
> container=container_111101111_0014_01_32636783
> and even if it was successfully killed, InterruptedException is typically 
> because of task attempt kill in this scenario (06:04:27.442Z)
>  
> <11>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1 
> 71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 class="tez.TezProcessor" 
> dagId="dag_1746256596566_0014_1634" 
> fragmentId="1746256596566_0014_1634_07_000070_13" level="ERROR" 
> queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5" 
> thread="TezTR-596566_14_1634_7_70_13"] java.lang.InterruptedException\r at 
> java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)\r
>  at 
> java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2090)\r
>  at 
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:147)\r
>  at 
> org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:107)\r
>  at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:138)\r
>  at 
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:133)\r
>  at 
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:122)\r
>  at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)\r
>  at 
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)\r 
> at 
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)\r
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:86)\r
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:72)\r
>  at java.base/java.security.AccessController.doPrivileged(Native Method)\r at 
> java.base/javax.security.auth.Subject.doAs(Subject.java:423)\r at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)\r
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:72)\r
>  at 
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:42)\r
>  at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)\r at 
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)\r
>  at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\r at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r
>  at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r
>  at java.base/java.lang.Thread.run(Thread.java:829)\r
> it already written a commit manifest before (06:04:27.442Z) <-- same 
> timestamp, but before
>  
> <14>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1 
> 71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 
> class="hive.HiveIcebergOutputCommitter" dagId="dag_1746256596566_0014_1634" 
> fragmentId="1746256596566_0014_1634_07_000070_13" level="INFO" 
> queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5" 
> thread="TezTR-596566_14_1634_7_70_13"] Created Iceberg commitTask manifest 
> file: 
> [s3a://s3-cdp-dub-prd-int-prd-464373096835/data/warehouse/tablespace/external/hive/adp_staging.db/fct_position_security_level_ext_2/temp/hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5-job_17462565965667_0014/task-70.forCommit]\rFilesForCommit\{dataFiles=[],
>  deleteFiles=[], replacedDataFiles=[], referencedDataFiles=[]}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to