[ 
https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chenyu Zheng updated HIVE-27985:
--------------------------------
    Attachment: how tez examples commit.png

> Avoid duplicate files.
> ----------------------
>
>                 Key: HIVE-27985
>                 URL: https://issues.apache.org/jira/browse/HIVE-27985
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>            Reporter: Chenyu Zheng
>            Assignee: Chenyu Zheng
>            Priority: Major
>         Attachments: how tez examples commit.png
>
>
> 1 background
> Hive on Tez occasionally produces duplicated files, especially speculative 
> execution is enable. Hive identifies and removes duplicate files through 
> removeTempOrDuplicateFiles. However, this logic often does not take effect. 
> For example, the killed task attempt may commit files during the execution of 
> this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during 
> union all. There are many issues to solve these problems, mainly focusing on 
> how to identify duplicate files. **This issue mainly solves this problem by 
> avoiding the generation of duplicate files.**
> 2 How Tez avoids duplicate files?
> After testing, I found that Hadoop MapReduce examples and Tez examples do not 
> have this problem. Through OutputCommitter, duplicate files can be avoided if 
> designed properly. Let's analyze how Tez avoids duplicate files.
> > Compared with Tez, Hadoop MapReduce has one more commitPending, which is 
> > not critical, so only analyzing Tez.
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to