[ https://issues.apache.org/jira/browse/HIVE-23703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karen Coppage reassigned HIVE-23703: ------------------------------------ > Major QB compaction with multiple FileSinkOperators results in data loss and > one original file > ---------------------------------------------------------------------------------------------- > > Key: HIVE-23703 > URL: https://issues.apache.org/jira/browse/HIVE-23703 > Project: Hive > Issue Type: Bug > Reporter: Karen Coppage > Assignee: Karen Coppage > Priority: Critical > Labels: compaction > > h4. Problems > Example: > {code:java} > drop table if exists tbl2; > create transactional table tbl2 (a int, b int) clustered by (a) into 4 > buckets stored as ORC > TBLPROPERTIES('transactional'='true','transactional_properties'='default'); > insert into tbl2 values(1,2),(1,3),(1,4),(2,2),(2,3),(2,4); > insert into tbl2 values(3,2),(3,3),(3,4),(4,2),(4,3),(4,4); > insert into tbl2 values(5,2),(5,3),(5,4),(6,2),(6,3),(6,4);{code} > E.g. in the example above, bucketId=0 when a=2 and a=6. > 1. Data lossĀ > In non-acid tables, an operator's temp files are named with their task id. > Because of this snippet, temp files in the FileSinkOperator for compaction > tables are identified by their bucket_id. > {code:java} > if (conf.isCompactionTable()) { > fsp.initializeBucketPaths(filesIdx, AcidUtils.BUCKET_PREFIX + > String.format(AcidUtils.BUCKET_DIGITS, bucketId), > isNativeTable(), isSkewedStoredAsSubDirectories); > } else { > fsp.initializeBucketPaths(filesIdx, taskId, isNativeTable(), > isSkewedStoredAsSubDirectories); > } > {code} > So 2 temp files containing data with a=2 and a=6 will be named bucket_0 and > not 000000_0 and 000000_1 as they would normally. > In FileSinkOperator.commit, when data with a=2, filename: bucket_0 is moved > from _task_tmp.-ext-10002 to _tmp.-ext-10002, it overwrites the files already > there with a=6 data, because it too is named bucket_0. You can see in the > logs: > {code:java} > WARN [LocalJobRunner Map Task Executor #0] exec.FileSinkOperator: Target > path > file:.../hive/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnNoBuckets-1591107230237/warehouse/testmajorcompaction/base_0000002_v0000013/.hive-staging_hive_2020-06-02_07-15-21_771_8551447285061957908-1/_tmp.-ext-10002/bucket_00000 > with a size 610 exists. Trying to delete it. > {code} > 2. Results in one original file > OrcFileMergeOperator merges the results of the FSOp into 1 file named > 000000_0. > h4. Fix > 1. FSOp will store data as: taskid/bucketId. e.g. 0_0/bucket_0 > 2. OrcMergeFileOp, instead of merging a bunch of files into 1 file named > 000000_0, will merge all files named bucket_0 into one file named bucket_0, > and so on. > 3. MoveTask will get rid of the taskId directories if present and only move > the bucket files in them, in case OrcMergeFileOp is not run. -- This message was sent by Atlassian Jira (v8.3.4#803005)