[ 
https://issues.apache.org/jira/browse/HIVE-27899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17788971#comment-17788971
 ] 

Sungwoo Park commented on HIVE-27899:
-------------------------------------

Calling canCommit() may not be a complete solution. For example, can we have a 
bad scenario like this?

TaskAttempt#1 calls canCommit(), writes output, and then fails for some reason.
Later TaskAttempt#2 calls canCommit(), writes output, and then completes 
successfully.


> Killed speculative execution task attempt should not commit file
> ----------------------------------------------------------------
>
>                 Key: HIVE-27899
>                 URL: https://issues.apache.org/jira/browse/HIVE-27899
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>            Reporter: Chenyu Zheng
>            Assignee: Chenyu Zheng
>            Priority: Major
>         Attachments: reproduce_bug.md
>
>
> As I mentioned in HIVE-25561, when tez turns on speculative execution, the 
> data file produced by hive may be duplicated. I mentioned in HIVE-25561 that 
> if the speculatively executed task is killed, some data may be submitted 
> unexpectedly. However, after HIVE-25561, there is still a situation that has 
> not been solved. If two task attempts commit file at the same time, the 
> problem of duplicate data files may also occur. Although the probability of 
> this happening is very, very low, it does happen.
>  
> Why?
> There are two key steps:
> (1)FileSinkOperator::closeOp
> TezProcessor::initializeAndRunProcessor --> ... --> FileSinkOperator::closeOp 
> --> fsp.commit
> When the OP is closed, the process of closing the OP will be triggered, and 
> eventually the call to fsp.commit will be triggered.
> (2)removeTempOrDuplicateFiles
> (2.a)Firstly, listStatus the files in the temporary directory. 
> (2.b)Secondly check whether there are multiple incorrect commit, and finally 
> move the correct results to the final directory.
> When speculative execution is enabled, when one attempt of a Task is 
> completed, other attempts will be killed. However, AM only sends the kill 
> event and does not ensure that all cleanup actions are completed, that is, 
> closeOp may be executed between 2.a and 2.b. Therefore, 
> removeTempOrDuplicateFiles will not delete the file generated by the kill 
> attempt.
> How?
> The problem is that both speculatively executed tasks commit the file. This 
> will not happen in the Tez examples because they will try canCommit, which 
> can guarantee that one and only one task attempt commit successfully. If one 
> task attempt executes canCommti successfully, the other one will be stuck by 
> canCommit until it receives a kill signal.
> detail see: 
> [https://github.com/apache/tez/blob/51d6f53967110e2b91b6d90b46f8e16bdc062091/tez-mapreduce/src/main/java/org/apache/tez/mapreduce/processor/SimpleMRProcessor.java#L70]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to