George Pachitariu created HIVE-23891: ----------------------------------------
Summary: Using UNION sql clause and speculative execution can cause file duplication in Tez Key: HIVE-23891 URL: https://issues.apache.org/jira/browse/HIVE-23891 Project: Hive Issue Type: Bug Reporter: George Pachitariu Assignee: George Pachitariu Hello, the specific scenario when this can happen: - the execution engine is Tez; - speculative execution is on; - the query inserts into a table and the last step is a UNION sql clause; The problem is that Tez creates an extra layer of subdirectories when there is a UNION. Later, when deduplicating, Hive doesn't take that into account and only deduplicates folders but not the files inside. So for a query like this: {code:sql} insert overwrite table union_all select * from union_first_part union all select * from union_second_part; {code} The folder structure afterwards will be like this (a possible example): {code:java} .../union_all/HIVE_UNION_SUBDIR_1/000000_0 .../union_all/HIVE_UNION_SUBDIR_1/000000_1 .../union_all/HIVE_UNION_SUBDIR_2/000000_1 {code} The attached patch increases the number of folder levels that Hive will check recursively for duplicates (recursively) when we have a UNION in Tez. Feel free to reach out if you have any questions :). -- This message was sent by Atlassian Jira (v8.3.4#803005)