[ https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17828613#comment-17828613 ]
Chenyu Zheng commented on HIVE-27985: ------------------------------------- [~abstractdog] [~jfs] [~pvary] [~kuczoram] [~nareshpr] Hi, can you please review this proposal? I think this proposal can fundamentally solve the problem of file duplication for hive on tez. Then we can enable speculative execution. > Avoid duplicate files. > ---------------------- > > Key: HIVE-27985 > URL: https://issues.apache.org/jira/browse/HIVE-27985 > Project: Hive > Issue Type: Bug > Components: Tez > Reporter: Chenyu Zheng > Assignee: Chenyu Zheng > Priority: Major > Attachments: how tez examples commit.png > > > *1 introducation* > Hive on Tez occasionally produces duplicated files, especially speculative > execution is enable. Hive identifies and removes duplicate files through > removeTempOrDuplicateFiles. However, this logic often does not take effect. > For example, the killed task attempt may commit files during the execution of > this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during > union all. There are many issues to solve these problems, mainly focusing on > how to identify duplicate files. *This issue mainly solves this problem by > avoiding the generation of duplicate files.* > *2 How Tez avoids duplicate files?* > After testing, I found that Hadoop MapReduce examples and Tez examples do not > have this problem. Through OutputCommitter, duplicate files can be avoided if > designed properly. Let's analyze how Tez avoids duplicate files. > {color:#172b4d} _Note: Compared with Tez, Hadoop MapReduce has one more > commitPending, which is not critical, so only analyzing Tez._{color} > !how tez examples commit.png|width=778,height=483! > > Let’s analyze this step: > * (1) {*}process records{*}: Process records. > * (2) {*}send canCommit request{*}: After all Records are processed, call > canCommit remotely to AM. > * (3) {*}update commitAttempt{*}: After AM receives the canCommit request, > it will check whether there are other tasksattempts in the current task that > have already executed canCommit. If there is no other taskattempt to execute > canCommit first, return true. Otherwise return false. This ensures that only > one taskattempt is committed for each task. > * (4) {*}return canCommit response{*}: Task receives AM's response. If > returns true, it means it can be committed. If false is returned, it means > that another task attempt has already executed the commit first, and you > cannot commit. The task will jump into (2) loop to execute canCommit until it > is killed or other tasks fail. > * (5) {*}output.commit{*}: Execute commit, specifically rename the generated > temporary file to the final file. > * (6) {*}notify succeeded{*}: Although the task has completed the final > file, AM still needs to be notified that its work is completed. Therefore, AM > needs to be notified through heartbeat that the current task attempt has been > completed. > There is a problem in the above steps. That is, if an exception occurs in the > task after (5) and before (6), AM does not know that the Task attempt has > been completed, so AM will still start a new task attempt, and the new task > attempt will generate a new file, so It will cause duplication. I added code > for randomly throwing exceptions between (5) and (6), and found that in fact, > Tez example did not produce data duplication. Why? Mainly because the final > file generated by which task attempt is the same is the same. When a new task > attempt commits and finds that the final file exists (this file was generated > by the previous task attempt), it will be deleted firstly, then renamed. > Regardless of whether the previous task attempt was committed normally, the > last successful task will clear the previous error results. > To summarize, tez-examples uses two methods to avoid duplicate files: > * (1) Avoid repeated commit through canCommit. This is particularly > effective for tasks with speculative execution turned on. > * (2) The final file names generated by different task attempts are the > same. Combined with canCommit, it can be guaranteed that only one file > generated in the end, and it can only be generated by a successful task > attempt. > *3 Why can't Hive on Tez avoid duplicate files?* > Hive on Tez does not have the two mechanisms mentioned in the Tez example. > First of all, Hive on Tez does not call canCommit.TezProcessor inherited from > AbstractLogicalIOProcessor. The logic of canCommit in Tez examples is mainly > in SimpleMRProcessor. > Secondly, the file names generated for each file under Hive on Tez are not > same. The file generated by the first attempt of a task is 000000_0, and the > file generated by the second attempt is 000000_1. > *4 How to improve?* > Use canCommit to ensure that speculative tasks will not be submitted at the > same time. (HIVE-27899) > Let different task attempts for each task generate the same final file name. > (HIVE-27986) -- This message was sent by Atlassian Jira (v8.20.10#820010)