[jira] [Assigned] (HIVE-23703) Major QB compaction with multiple FileSinkOperators results in data loss and one original file

Karen Coppage (Jira) Tue, 16 Jun 2020 07:59:14 -0700


     [ 
https://issues.apache.org/jira/browse/HIVE-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Karen Coppage reassigned HIVE-23703:
------------------------------------


> Major QB compaction with multiple FileSinkOperators results in data loss and 
> one original file
> ----------------------------------------------------------------------------------------------
>
>                 Key: HIVE-23703
>                 URL: https://issues.apache.org/jira/browse/HIVE-23703
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Critical
>              Labels: compaction
>
> h4. Problems
> Example:
> {code:java}
> drop table if exists tbl2;
> create transactional table tbl2 (a int, b int) clustered by (a) into 4 
> buckets stored as ORC 
> TBLPROPERTIES('transactional'='true','transactional_properties'='default');
> insert into tbl2 values(1,2),(1,3),(1,4),(2,2),(2,3),(2,4);
> insert into tbl2 values(3,2),(3,3),(3,4),(4,2),(4,3),(4,4);
> insert into tbl2 values(5,2),(5,3),(5,4),(6,2),(6,3),(6,4);{code}
> E.g. in the example above, bucketId=0 when a=2 and a=6.
> 1. Data loss 
>  In non-acid tables, an operator's temp files are named with their task id. 
> Because of this snippet, temp files in the FileSinkOperator for compaction 
> tables are identified by their bucket_id.
> {code:java}
> if (conf.isCompactionTable()) {
>  fsp.initializeBucketPaths(filesIdx, AcidUtils.BUCKET_PREFIX + 
> String.format(AcidUtils.BUCKET_DIGITS, bucketId),
>  isNativeTable(), isSkewedStoredAsSubDirectories);
>  } else {
>  fsp.initializeBucketPaths(filesIdx, taskId, isNativeTable(), 
> isSkewedStoredAsSubDirectories);
>  }
> {code}
> So 2 temp files containing data with a=2 and a=6 will be named bucket_0 and 
> not 000000_0 and 000000_1 as they would normally.
>  In FileSinkOperator.commit, when data with a=2, filename: bucket_0 is moved 
> from _task_tmp.-ext-10002 to _tmp.-ext-10002, it overwrites the files already 
> there with a=6 data, because it too is named bucket_0. You can see in the 
> logs:
> {code:java}
>  WARN [LocalJobRunner Map Task Executor #0] exec.FileSinkOperator: Target 
> path 
> file:.../hive/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnNoBuckets-1591107230237/warehouse/testmajorcompaction/base_0000002_v0000013/.hive-staging_hive_2020-06-02_07-15-21_771_8551447285061957908-1/_tmp.-ext-10002/bucket_00000
>  with a size 610 exists. Trying to delete it.
> {code}
> 2. Results in one original file
>  OrcFileMergeOperator merges the results of the FSOp into 1 file named 
> 000000_0.
> h4. Fix
> 1. FSOp will store data as: taskid/bucketId. e.g. 0_0/bucket_0
> 2. OrcMergeFileOp, instead of merging a bunch of files into 1 file named 
> 000000_0, will merge all files named bucket_0 into one file named bucket_0, 
> and so on.
> 3. MoveTask will get rid of the taskId directories if present and only move 
> the bucket files in them, in case OrcMergeFileOp is not run.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Assigned] (HIVE-23703) Major QB compaction with multiple FileSinkOperators results in data loss and one original file

Reply via email to