Wei Zheng created HIVE-12724: -------------------------------- Summary: ACID: Major compaction fails to include the original bucket files into MR job Key: HIVE-12724 URL: https://issues.apache.org/jira/browse/HIVE-12724 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 2.0.0, 2.1.0 Reporter: Wei Zheng Assignee: Wei Zheng
How the problem happens: * Create a non-ACID table * Before non-ACID to ACID table conversion, we inserted row one * After non-ACID to ACID table conversion, we inserted row two * Both rows can be retrieved before MAJOR compaction * After MAJOR compaction, row one is lost {code} hive> USE acidtest; OK Time taken: 0.77 seconds hive> CREATE TABLE t1 (nationkey INT, name STRING, regionkey INT, comment STRING) > CLUSTERED BY (regionkey) INTO 2 BUCKETS > STORED AS ORC; OK Time taken: 0.179 seconds hive> DESC FORMATTED t1; OK # col_name data_type comment nationkey int name string regionkey int comment string # Detailed Table Information Database: acidtest Owner: wzheng CreateTime: Mon Dec 14 15:50:40 PST 2015 LastAccessTime: UNKNOWN Retention: 0 Location: file:/Users/wzheng/hivetmp/warehouse/acidtest.db/t1 Table Type: MANAGED_TABLE Table Parameters: transient_lastDdlTime 1450137040 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: 2 Bucket Columns: [regionkey] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.198 seconds, Fetched: 28 row(s) hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db; Found 1 items drwxr-xr-x - wzheng staff 68 2015-12-14 15:50 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1 hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1; hive> INSERT INTO TABLE t1 VALUES (1, 'USA', 1, 'united states'); WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = wzheng_20151214155028_630098c6-605f-4e7e-a797-6b49fb48360d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 2 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2015-12-14 15:51:58,070 Stage-1 map = 100%, reduce = 100% Ended Job = job_local73977356_0001 Loading data to table acidtest.t1 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 2.825 seconds hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1; Found 2 items -rwxr-xr-x 1 wzheng staff 112 2015-12-14 15:51 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000000_0 -rwxr-xr-x 1 wzheng staff 472 2015-12-14 15:51 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000001_0 hive> SELECT * FROM t1; OK 1 USA 1 united states Time taken: 0.434 seconds, Fetched: 1 row(s) hive> ALTER TABLE t1 SET TBLPROPERTIES ('transactional' = 'true'); OK Time taken: 0.071 seconds hive> DESC FORMATTED t1; OK # col_name data_type comment nationkey int name string regionkey int comment string # Detailed Table Information Database: acidtest Owner: wzheng CreateTime: Mon Dec 14 15:50:40 PST 2015 LastAccessTime: UNKNOWN Retention: 0 Location: file:/Users/wzheng/hivetmp/warehouse/acidtest.db/t1 Table Type: MANAGED_TABLE Table Parameters: COLUMN_STATS_ACCURATE false last_modified_by wzheng last_modified_time 1450137141 numFiles 2 numRows -1 rawDataSize -1 totalSize 584 transactional true transient_lastDdlTime 1450137141 # Storage Information SerDe Library: org.apache.hadoop.hive.ql.io.orc.OrcSerde InputFormat: org.apache.hadoop.hive.ql.io.orc.OrcInputFormat OutputFormat: org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat Compressed: No Num Buckets: 2 Bucket Columns: [regionkey] Sort Columns: [] Storage Desc Params: serialization.format 1 Time taken: 0.049 seconds, Fetched: 36 row(s) hive> set hive.support.concurrency=true; hive> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager; hive> set hive.compactor.initiator.on=true; hive> set hive.compactor.worker.threads=5; hive> set hive.exec.dynamic.partition.mode=nonstrict; hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1; Found 2 items -rwxr-xr-x 1 wzheng staff 112 2015-12-14 15:51 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000000_0 -rwxr-xr-x 1 wzheng staff 472 2015-12-14 15:51 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000001_0 hive> INSERT INTO TABLE t1 VALUES (2, 'Canada', 1, 'maple leaf'); WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = wzheng_20151214155028_630098c6-605f-4e7e-a797-6b49fb48360d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 2 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2015-12-14 15:54:18,943 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1674014367_0002 Loading data to table acidtest.t1 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK Time taken: 1.995 seconds hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1; Found 3 items -rwxr-xr-x 1 wzheng staff 112 2015-12-14 15:51 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000000_0 -rwxr-xr-x 1 wzheng staff 472 2015-12-14 15:51 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000001_0 drwxr-xr-x - wzheng staff 204 2015-12-14 15:54 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/delta_0000007_0000007_0000 hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/delta_0000007_0000007_0000; Found 2 items -rw-r--r-- 1 wzheng staff 214 2015-12-14 15:54 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/delta_0000007_0000007_0000/bucket_00000 -rw-r--r-- 1 wzheng staff 797 2015-12-14 15:54 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/delta_0000007_0000007_0000/bucket_00001 hive> SELECT * FROM t1; OK 1 USA 1 united states 2 Canada 1 maple leaf Time taken: 0.1 seconds, Fetched: 2 row(s) hive> ALTER TABLE t1 COMPACT 'MAJOR'; Compaction enqueued. OK Time taken: 0.026 seconds hive> show compactions; OK Database Table Partition Type State Worker Start Time Time taken: 0.022 seconds, Fetched: 1 row(s) hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/; Found 3 items -rwxr-xr-x 1 wzheng staff 112 2015-12-14 15:51 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000000_0 -rwxr-xr-x 1 wzheng staff 472 2015-12-14 15:51 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000001_0 drwxr-xr-x - wzheng staff 204 2015-12-14 15:55 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/base_0000007 hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/base_0000007; Found 2 items -rw-r--r-- 1 wzheng staff 222 2015-12-14 15:55 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/base_0000007/bucket_00000 -rw-r--r-- 1 wzheng staff 802 2015-12-14 15:55 /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/base_0000007/bucket_00001 hive> select * from t1; OK 2 Canada 1 maple leaf Time taken: 0.396 seconds, Fetched: 1 row(s) hive> select count(*) from t1; WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the future versions. Consider using a different execution engine (i.e. tez, spark) or using Hive 1.X releases. Query ID = wzheng_20151214155028_630098c6-605f-4e7e-a797-6b49fb48360d Total jobs = 1 Launching Job 1 out of 1 Number of reduce tasks determined at compile time: 1 In order to change the average load for a reducer (in bytes): set hive.exec.reducers.bytes.per.reducer=<number> In order to limit the maximum number of reducers: set hive.exec.reducers.max=<number> In order to set a constant number of reducers: set mapreduce.job.reduces=<number> Job running in-process (local Hadoop) 2015-12-14 15:56:20,277 Stage-1 map = 100%, reduce = 100% Ended Job = job_local1720993786_0003 MapReduce Jobs Launched: Stage-Stage-1: HDFS Read: 0 HDFS Write: 0 SUCCESS Total MapReduce CPU Time Spent: 0 msec OK 1 Time taken: 1.623 seconds, Fetched: 1 row(s) {code} Note, the cleanup doesn't kick in because the compaction fails already. The cleanup itself doesn't have any problem (at least not that we know of for this case). -- This message was sent by Atlassian JIRA (v6.3.4#6332)