[jira] [Commented] (HIVE-27985) Avoid duplicate files.

Chenyu Zheng (Jira) Tue, 19 Mar 2024 23:35:30 -0700


    [ 
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828613#comment-17828613
 ]


Chenyu Zheng commented on HIVE-27985:
-------------------------------------

[~abstractdog] [~jfs] [~pvary] [~kuczoram] [~nareshpr] 

Hi, can you please review this proposal? I think this proposal can 
fundamentally solve the problem of file duplication for hive on tez. Then we 
can enable speculative execution.

> Avoid duplicate files.
> ----------------------
>
>                 Key: HIVE-27985
>                 URL: https://issues.apache.org/jira/browse/HIVE-27985
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>            Reporter: Chenyu Zheng
>            Assignee: Chenyu Zheng
>            Priority: Major
>         Attachments: how tez examples commit.png
>
>
> *1 introducation*
> Hive on Tez occasionally produces duplicated files, especially speculative 
> execution is enable. Hive identifies and removes duplicate files through 
> removeTempOrDuplicateFiles. However, this logic often does not take effect. 
> For example, the killed task attempt may commit files during the execution of 
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during 
> union all. There are many issues to solve these problems, mainly focusing on 
> how to identify duplicate files. *This issue mainly solves this problem by 
> avoiding the generation of duplicate files.*
> *2 How Tez avoids duplicate files?*
> After testing, I found that Hadoop MapReduce examples and Tez examples do not 
> have this problem. Through OutputCommitter, duplicate files can be avoided if 
> designed properly. Let's analyze how Tez avoids duplicate files.
> {color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more 
> commitPending, which is not critical, so only analyzing Tez._{color}
> !how tez examples commit.png|width=778,height=483!
>  
> Let’s analyze this step:
>  * (1) {*}process records{*}: Process records.
>  * (2) {*}send canCommit request{*}: After all Records are processed, call 
> canCommit remotely to AM.
>  * (3) {*}update commitAttempt{*}: After AM receives the canCommit request, 
> it will check whether there are other tasksattempts in the current task that 
> have already executed canCommit. If there is no other taskattempt to execute 
> canCommit first, return true. Otherwise return false. This ensures that only 
> one taskattempt is committed for each task.
>  * (4) {*}return canCommit response{*}: Task receives AM's response. If 
> returns true, it means it can be committed. If false is returned, it means 
> that another task attempt has already executed the commit first, and you 
> cannot commit. The task will jump into (2) loop to execute canCommit until it 
> is killed or other tasks fail.
>  * (5) {*}output.commit{*}: Execute commit, specifically rename the generated 
> temporary file to the final file.
>  * (6) {*}notify succeeded{*}: Although the task has completed the final 
> file, AM still needs to be notified that its work is completed. Therefore, AM 
> needs to be notified through heartbeat that the current task attempt has been 
> completed.
> There is a problem in the above steps. That is, if an exception occurs in the 
> task after (5) and before (6), AM does not know that the Task attempt has 
> been completed, so AM will still start a new task attempt, and the new task 
> attempt will generate a new file, so It will cause duplication. I added code 
> for randomly throwing exceptions between (5) and (6), and found that in fact, 
> Tez example did not produce data duplication. Why? Mainly because the final 
> file generated by which task attempt is the same is the same. When a new task 
> attempt commits and finds that the final file exists (this file was generated 
> by the previous task attempt), it will be deleted firstly, then renamed. 
> Regardless of whether the previous task attempt was committed normally, the 
> last successful task will clear the previous error results.
> To summarize, tez-examples uses two methods to avoid duplicate files:
>  * (1) Avoid repeated commit through canCommit. This is particularly 
> effective for tasks with speculative execution turned on.
>  * (2) The final file names generated by different task attempts are the 
> same. Combined with canCommit, it can be guaranteed that only one file 
> generated in the end, and it can only be generated by a successful task 
> attempt.
> *3 Why can't Hive on Tez avoid duplicate files?*
> Hive on Tez does not have the two mechanisms mentioned in the Tez example.
> First of all, Hive on Tez does not call canCommit.TezProcessor inherited from 
> AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly 
> in SimpleMRProcessor.
> Secondly, the file names generated for each file under Hive on Tez are not 
> same. The file generated by the first attempt of a task is 000000_0, and the 
> file generated by the second attempt is 000000_1.
> *4 How to improve?*
> Use canCommit to ensure that speculative tasks will not be submitted at the 
> same time. (HIVE-27899)
> Let different task attempts for each task generate the same final file name. 
> (HIVE-27986)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HIVE-27985) Avoid duplicate files.

Reply via email to