[ https://issues.apache.org/jira/browse/HIVE-27985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Chenyu Zheng updated HIVE-27985: -------------------------------- Attachment: how tez examples commit.png > Avoid duplicate files. > ---------------------- > > Key: HIVE-27985 > URL: https://issues.apache.org/jira/browse/HIVE-27985 > Project: Hive > Issue Type: Bug > Components: Tez > Reporter: Chenyu Zheng > Assignee: Chenyu Zheng > Priority: Major > Attachments: how tez examples commit.png > > > 1 background > Hive on Tez occasionally produces duplicated files, especially speculative > execution is enable. Hive identifies and removes duplicate files through > removeTempOrDuplicateFiles. However, this logic often does not take effect. > For example, the killed task attempt may commit files during the execution of > this method. Or the files under HIVE_UNION_SUBDIR_X are not recognized during > union all. There are many issues to solve these problems, mainly focusing on > how to identify duplicate files. **This issue mainly solves this problem by > avoiding the generation of duplicate files.** > 2 How Tez avoids duplicate files? > After testing, I found that Hadoop MapReduce examples and Tez examples do not > have this problem. Through OutputCommitter, duplicate files can be avoided if > designed properly. Let's analyze how Tez avoids duplicate files. > > Compared with Tez, Hadoop MapReduce has one more commitPending, which is > > not critical, so only analyzing Tez. > > -- This message was sent by Atlassian Jira (v8.20.10#820010)