[ https://issues.apache.org/jira/browse/HIVE-23891?focusedWorklogId=831287&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-831287 ]
ASF GitHub Bot logged work on HIVE-23891: ----------------------------------------- Author: ASF GitHub Bot Created on: 06/Dec/22 07:31 Start Date: 06/Dec/22 07:31 Worklog Time Spent: 10m Work Description: sonarcloud[bot] commented on PR #3836: URL: https://github.com/apache/hive/pull/3836#issuecomment-1338903701 Kudos, SonarCloud Quality Gate passed! [](https://sonarcloud.io/dashboard?id=apache_hive&pullRequest=3836) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=BUG) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=BUG) [0 Bugs](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=BUG) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=VULNERABILITY) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=VULNERABILITY) [0 Vulnerabilities](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=VULNERABILITY) [](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3836&resolved=false&types=SECURITY_HOTSPOT) [](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3836&resolved=false&types=SECURITY_HOTSPOT) [0 Security Hotspots](https://sonarcloud.io/project/security_hotspots?id=apache_hive&pullRequest=3836&resolved=false&types=SECURITY_HOTSPOT) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=CODE_SMELL) [](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=CODE_SMELL) [8 Code Smells](https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=3836&resolved=false&types=CODE_SMELL) [](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3836&metric=coverage&view=list) No Coverage information [](https://sonarcloud.io/component_measures?id=apache_hive&pullRequest=3836&metric=duplicated_lines_density&view=list) No Duplication information Issue Time Tracking ------------------- Worklog Id: (was: 831287) Time Spent: 2h 50m (was: 2h 40m) > Using UNION sql clause and speculative execution can cause file duplication > in Tez > ---------------------------------------------------------------------------------- > > Key: HIVE-23891 > URL: https://issues.apache.org/jira/browse/HIVE-23891 > Project: Hive > Issue Type: Bug > Reporter: George Pachitariu > Assignee: George Pachitariu > Priority: Major > Labels: pull-request-available > Attachments: HIVE-23891.1.patch > > Time Spent: 2h 50m > Remaining Estimate: 0h > > Hello, > the specific scenario when this can happen: > - the execution engine is Tez; > - speculative execution is on; > - the query inserts into a table and the last step is a UNION sql clause; > The problem is that Tez creates an extra layer of subdirectories when there > is a UNION. Later, when deduplicating, Hive doesn't take that into account > and only deduplicates folders but not the files inside. > So for a query like this: > {code:sql} > insert overwrite table union_all > select * from union_first_part > union all > select * from union_second_part; > {code} > The folder structure afterwards will be like this (a possible example): > {code:java} > .../union_all/HIVE_UNION_SUBDIR_1/000000_0 > .../union_all/HIVE_UNION_SUBDIR_1/000000_1 > .../union_all/HIVE_UNION_SUBDIR_2/000000_1 > {code} > The attached patch increases the number of folder levels that Hive will check > recursively for duplicates when we have a UNION in Tez. > Feel free to reach out if you have any questions :). -- This message was sent by Atlassian Jira (v8.20.10#820010)