[ 
https://issues.apache.org/jira/browse/HIVE-14535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15593084#comment-15593084
 ] 

Gopal V commented on HIVE-14535:
--------------------------------

>  Do you think it would be reasonable to commit the changes to the 
> FileSinkOperator without the rest of the MM tables support?

No, a direct output committer approach without query isolation has lost data 
for production customers before, by forcing multiple tasks to write to the same 
file-name by accident - due to the way checksum-safety works, the first writer 
is not the winner in failure-tolerance scenarios.

We want to prevent users from making such expensive mistakes again, so this 
patch isolates different queries from each other - without which you will stomp 
over files.

>  I know there are some concerns that this "direct output committer" approach 
> could cause data corruption issues, is this something was considered 
> explicitly in the design? If so, could you expand on why those data 
> corruption issues would occur?

Without the isolation fix, the other parts are dangerous to use. 

With the isolation in place, the system moves away from the move model to a 
cleanup model (the cleanup code already exists, it is just applied to the 
scratch dir today).

> add micromanaged tables to Hive (metastore keeps track of the files)
> --------------------------------------------------------------------
>
>                 Key: HIVE-14535
>                 URL: https://issues.apache.org/jira/browse/HIVE-14535
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>
> Design doc: 
> https://docs.google.com/document/d/1b3t1RywfyRb73-cdvkEzJUyOiekWwkMHdiQ-42zCllY
> Feel free to comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to