[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task

Jason Dere (JIRA) Tue, 25 Jul 2017 16:41:15 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-17113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16100951#comment-16100951
 ]


Jason Dere commented on HIVE-17113:
-----------------------------------

Spoke offline to [~ashutoshc], who recommended the following approach:
- During Utilities.removeTempOrDuplicateFiles(), maintain a list of files 
found/deduped. This list of files will be used to determine which files are 
moved to the destination directory.
- A configurable setting will be added here to control whether this file list 
will be used to control which files will be moved, or if the existing behavior 
will be used.

> Duplicate bucket files can get written to table by runaway task
> ---------------------------------------------------------------
>
>                 Key: HIVE-17113
>                 URL: https://issues.apache.org/jira/browse/HIVE-17113
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Processor
>            Reporter: Jason Dere
>            Assignee: Jason Dere
>         Attachments: HIVE-17113.1.patch
>
>
> Saw a table get a duplicate bucket file from a Hive query. It looks like the 
> following happened:
> 1. Task attempt A_0 starts,but then stops making progress
> 2. The job was running with speculative execution on, and task attempt A_1 is 
> started
> 3. Task attempt A_1 finishes execution and saves its output to the temp 
> directory.
> 5. A task kill is sent to A_0, though this does appear to actually kill A_0
> 6. The job for the query finishes and Utilities.mvFileToFinalPath() calls 
> Utilities.removeTempOrDuplicateFiles() to check for duplicate bucket files
> 7. A_0 (still running) finally finishes and saves its file to the temp 
> directory. At this point we now have duplicate bucket files - oops!
> 8. Utilities.removeTempOrDuplicateFiles() moves the temp directory to the 
> final location, where it is later moved to the partition directory.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HIVE-17113) Duplicate bucket files can get written to table by runaway task

Reply via email to