[
https://issues.apache.org/jira/browse/HIVE-28962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17954394#comment-17954394
]
Denys Kuzmenko commented on HIVE-28962:
---------------------------------------
Merged to master
Thanks [~simhadri-g] and [~abstractdog] for the review!
> Prevent committing outputs if an exception was thrown in the Tez processor
> --------------------------------------------------------------------------
>
> Key: HIVE-28962
> URL: https://issues.apache.org/jira/browse/HIVE-28962
> Project: Hive
> Issue Type: Bug
> Components: Iceberg integration
> Reporter: Denys Kuzmenko
> Assignee: Denys Kuzmenko
> Priority: Major
> Labels: pull-request-available
>
> Communication failure between coordinator and executor led to a task kill
> (06:04:18.328Z)
> {code:java}
> app=query-coordinator-0/2025-05-16-06-00_query-coordinator_query-coordinator-0-7_638e93f4-c948-4fc0-ad05-1451fc4b7c53_0.log:<14>1
> 2025-05-16T06:04:18.328Z query-coordinator-0-7 query-coordinator 1
> 638e93f4-c948-4fc0-ad05-1451fc4b7c53 [mdc@38374
> class="HistoryEventHandler.criticalEvents" level="INFO" thread="Dispatcher
> thread {Central}"]
> [HISTORY][DAG:dag_1746256596566_0014_1634][Event:TASK_ATTEMPT_FINISHED]:
> vertexName=Reducer 3,
> taskAttemptId=attempt_1746256596566_0014_1634_07_000070_13,
> creationTime=1747375458134, allocationTime=1747375458277,
> startTime=1747375458294, finishTime=1747375458328, timeTaken=34,
> status=KILLED, errorEnum=NODE_FAILED, diagnostics=Node with same host and
> port but with new unique ID pinged, nodeHttpAddress=<>, counters=Counters: 1,
> org.apache.tez.common.counters.TaskCounter, TASK_DURATION_MILLIS=34
> {code:java}
> {code}
> then, an InterruptedException was thrown by `ReduceRecordProcessor.init()`
> because of task attempt kill (06:04:27.442Z)
> {code:java}
> <11>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1
> 71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374 class="tez.TezProcessor"
> dagId="dag_1746256596566_0014_1634"
> fragmentId="1746256596566_0014_1634_07_000070_13" level="ERROR"
> queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5"
> thread="TezTR-596566_14_1634_7_70_13"] java.lang.InterruptedException\r at
> java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2056)\r
> at
> java.base/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2090)\r
> at
> org.apache.tez.runtime.InputReadyTracker$InputReadyMonitor.awaitCondition(InputReadyTracker.java:147)\r
> at
> org.apache.tez.runtime.InputReadyTracker.waitForAllInputsReady(InputReadyTracker.java:107)\r
> at
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:138)\r
> at
> org.apache.tez.runtime.api.impl.TezProcessorContextImpl.waitForAllInputsReady(TezProcessorContextImpl.java:133)\r
> at
> org.apache.hadoop.hive.ql.exec.tez.ReduceRecordProcessor.init(ReduceRecordProcessor.java:122)\r
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.initializeAndRunProcessor(TezProcessor.java:296)\r
> at
> org.apache.hadoop.hive.ql.exec.tez.TezProcessor.run(TezProcessor.java:280)\r
> at
> org.apache.tez.runtime.LogicalIOProcessorRuntimeTask.run(LogicalIOProcessorRuntimeTask.java:374)\r
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:86)\r
> at
> org.apache.tez.runtime.task.TaskRunner2Callable$1.run(TaskRunner2Callable.java:72)\r
> at java.base/java.security.AccessController.doPrivileged(Native Method)\r at
> java.base/javax.security.auth.Subject.doAs(Subject.java:423)\r at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)\r
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:72)\r
> at
> org.apache.tez.runtime.task.TaskRunner2Callable.callInternal(TaskRunner2Callable.java:42)\r
> at org.apache.tez.common.CallableWithNdc.call(CallableWithNdc.java:36)\r at
> org.apache.hadoop.hive.llap.daemon.impl.StatsRecordingThreadPool$WrappedCallable.call(StatsRecordingThreadPool.java:118)\r
> at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)\r at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\r
> at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\r
> at java.base/java.lang.Thread.run(Thread.java:829)\r
> {code}
> however, it still allowed task to commit and created a commit manifest
> (06:04:27.442Z)
> {code:java}
>
> <14>1 2025-05-16T06:04:27.442Z query-executor-0-16 query-executor 1
> 71c82503-aeaa-4ca8-ad1c-c01db6df4f9d [mdc@38374
> class="hive.HiveIcebergOutputCommitter" dagId="dag_1746256596566_0014_1634"
> fragmentId="1746256596566_0014_1634_07_000070_13" level="INFO"
> queryId="hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5"
> thread="TezTR-596566_14_1634_7_70_13"] Created Iceberg commitTask manifest
> file:
> [s3a://temp/hive_20250516055906_b7f7cff1-fc84-44d2-a7c5-30bd9aa428c5-job_17462565965667_0014/task-70.forCommit]\rFilesForCommit\{dataFiles=[],
> deleteFiles=[], replacedDataFiles=[], referencedDataFiles=[]}
> {code}
> that is a regression introduced by HIVE-24857
--
This message was sent by Atlassian Jira
(v8.20.10#820010)