[ https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100564#comment-16100564 ]
Jason Dere commented on HIVE-17113: ----------------------------------- Looks like in the case of skewjoin in Spark, there can be multiple jobs which copy files into the same temp directory. When this happens, there can be name collisions - in the test there are collisions on files 000000_0 and 000001_0, which get renamed to 000000_0_1 and 000001_0_1. Since the removeTempOrDuplicateFiles() is now being called on the destination directory, it's not able to correctly disambiguate the 000000_0_1, 000001_0_1 files. Since it looks like the destination directory can potentially hold results from more than one job, it does not seem to be correct to simply run removeTempOrDuplicateFiles() on the destination directory. Maybe we have to change the logic to the following: 1) Move the temp directory to a new directory name, to prevent additional files from being added by any runaway processes. 2) Run removeTempOrDuplicateFiles() on this renamed temp directory 3) Run renameOrMoveFiles() to move the renamed temp directory to the final location. Though step 1 might be expensive for cloud storage (basically means performing twice the file moves right?) .. [~ashutoshc] should doing step 1 be a configurable setting? > Duplicate bucket files can get written to table by runaway task > --------------------------------------------------------------- > > Key: HIVE-17113 > URL: https://issues.apache.org/jira/browse/HIVE-17113 > Project: Hive > Issue Type: Bug > Components: Query Processor > Reporter: Jason Dere > Assignee: Jason Dere > Attachments: HIVE-17113.1.patch > > > Saw a table get a duplicate bucket file from a Hive query. It looks like the > following happened: > 1. Task attempt A_0 starts,but then stops making progress > 2. The job was running with speculative execution on, and task attempt A_1 is > started > 3. Task attempt A_1 finishes execution and saves its output to the temp > directory. > 5. A task kill is sent to A_0, though this does appear to actually kill A_0 > 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls > Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files > 7. A_0 (still running) finally finishes and saves its file to the temp > directory. At this point we now have duplicate bucket files - oops! > 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the > final location, where it is later moved to the partition directory. -- This message was sent by Atlassian JIRA (v6.4.14#64029)