[jira] [Commented] (HIVE-12724) ACID: Major compaction fails to include the original bucket files into MR job

Eugene Koifman (JIRA) Thu, 14 Jan 2016 14:00:52 -0800

    [ 
https://issues.apache.org/jira/browse/HIVE-12724?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15098971#comment-15098971
 ]


Eugene Koifman commented on HIVE-12724:
---------------------------------------

made a Blocker since this is data loss

> ACID: Major compaction fails to include the original bucket files into MR job
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-12724
>                 URL: https://issues.apache.org/jira/browse/HIVE-12724
>             Project: Hive
>          Issue Type: Bug
>          Components: Hive, Transactions
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Wei Zheng
>            Assignee: Wei Zheng
>            Priority: Blocker
>         Attachments: HIVE-12724.1.patch, HIVE-12724.2.patch, 
> HIVE-12724.3.patch, HIVE-12724.4.patch, HIVE-12724.ADDENDUM.1.patch, 
> HIVE-12724.branch-1.2.patch, HIVE-12724.branch-1.patch
>
>
> How the problem happens:
> * Create a non-ACID table
> * Before non-ACID to ACID table conversion, we inserted row one
> * After non-ACID to ACID table conversion, we inserted row two
> * Both rows can be retrieved before MAJOR compaction
> * After MAJOR compaction, row one is lost
> {code}
> hive> USE acidtest;
> OK
> Time taken: 0.77 seconds
> hive> CREATE TABLE t1 (nationkey INT, name STRING, regionkey INT, comment 
> STRING)
>     > CLUSTERED BY (regionkey) INTO 2 BUCKETS
>     > STORED AS ORC;
> OK
> Time taken: 0.179 seconds
> hive> DESC FORMATTED t1;
> OK
> # col_name                    data_type               comment
> nationkey             int
> name                  string
> regionkey             int
> comment               string
> # Detailed Table Information
> Database:             acidtest
> Owner:                wzheng
> CreateTime:           Mon Dec 14 15:50:40 PST 2015
> LastAccessTime:       UNKNOWN
> Retention:            0
> Location:             file:/Users/wzheng/hivetmp/warehouse/acidtest.db/t1
> Table Type:           MANAGED_TABLE
> Table Parameters:
>       transient_lastDdlTime   1450137040
> # Storage Information
> SerDe Library:        org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:          org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> OutputFormat:         org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
> Compressed:           No
> Num Buckets:          2
> Bucket Columns:       [regionkey]
> Sort Columns:         []
> Storage Desc Params:
>       serialization.format    1
> Time taken: 0.198 seconds, Fetched: 28 row(s)
> hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db;
> Found 1 items
> drwxr-xr-x   - wzheng staff         68 2015-12-14 15:50 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1
> hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1;
> hive> INSERT INTO TABLE t1 VALUES (1, 'USA', 1, 'united states');
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. tez, 
> spark) or using Hive 1.X releases.
> Query ID = wzheng_20151214155028_630098c6-605f-4e7e-a797-6b49fb48360d
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 2
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Job running in-process (local Hadoop)
> 2015-12-14 15:51:58,070 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_local73977356_0001
> Loading data to table acidtest.t1
> MapReduce Jobs Launched:
> Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 0 msec
> OK
> Time taken: 2.825 seconds
> hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1;
> Found 2 items
> -rwxr-xr-x   1 wzheng staff        112 2015-12-14 15:51 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000000_0
> -rwxr-xr-x   1 wzheng staff        472 2015-12-14 15:51 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000001_0
> hive> SELECT * FROM t1;
> OK
> 1     USA     1       united states
> Time taken: 0.434 seconds, Fetched: 1 row(s)
> hive> ALTER TABLE t1 SET TBLPROPERTIES ('transactional' = 'true');
> OK
> Time taken: 0.071 seconds
> hive> DESC FORMATTED t1;
> OK
> # col_name                    data_type               comment
> nationkey             int
> name                  string
> regionkey             int
> comment               string
> # Detailed Table Information
> Database:             acidtest
> Owner:                wzheng
> CreateTime:           Mon Dec 14 15:50:40 PST 2015
> LastAccessTime:       UNKNOWN
> Retention:            0
> Location:             file:/Users/wzheng/hivetmp/warehouse/acidtest.db/t1
> Table Type:           MANAGED_TABLE
> Table Parameters:
>       COLUMN_STATS_ACCURATE   false
>       last_modified_by        wzheng
>       last_modified_time      1450137141
>       numFiles                2
>       numRows                 -1
>       rawDataSize             -1
>       totalSize               584
>       transactional           true
>       transient_lastDdlTime   1450137141
> # Storage Information
> SerDe Library:        org.apache.hadoop.hive.ql.io.orc.OrcSerde
> InputFormat:          org.apache.hadoop.hive.ql.io.orc.OrcInputFormat
> OutputFormat:         org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat
> Compressed:           No
> Num Buckets:          2
> Bucket Columns:       [regionkey]
> Sort Columns:         []
> Storage Desc Params:
>       serialization.format    1
> Time taken: 0.049 seconds, Fetched: 36 row(s)
> hive> set hive.support.concurrency=true;
> hive> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> hive> set hive.compactor.initiator.on=true;
> hive> set hive.compactor.worker.threads=5;
> hive> set hive.exec.dynamic.partition.mode=nonstrict;
> hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1;
> Found 2 items
> -rwxr-xr-x   1 wzheng staff        112 2015-12-14 15:51 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000000_0
> -rwxr-xr-x   1 wzheng staff        472 2015-12-14 15:51 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000001_0
> hive> INSERT INTO TABLE t1 VALUES (2, 'Canada', 1, 'maple leaf');
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. tez, 
> spark) or using Hive 1.X releases.
> Query ID = wzheng_20151214155028_630098c6-605f-4e7e-a797-6b49fb48360d
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 2
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Job running in-process (local Hadoop)
> 2015-12-14 15:54:18,943 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_local1674014367_0002
> Loading data to table acidtest.t1
> MapReduce Jobs Launched:
> Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 0 msec
> OK
> Time taken: 1.995 seconds
> hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1;
> Found 3 items
> -rwxr-xr-x   1 wzheng staff        112 2015-12-14 15:51 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000000_0
> -rwxr-xr-x   1 wzheng staff        472 2015-12-14 15:51 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000001_0
> drwxr-xr-x   - wzheng staff        204 2015-12-14 15:54 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/delta_0000007_0000007_0000
> hive> dfs -ls 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/delta_0000007_0000007_0000;
> Found 2 items
> -rw-r--r--   1 wzheng staff        214 2015-12-14 15:54 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/delta_0000007_0000007_0000/bucket_00000
> -rw-r--r--   1 wzheng staff        797 2015-12-14 15:54 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/delta_0000007_0000007_0000/bucket_00001
> hive> SELECT * FROM t1;
> OK
> 1     USA     1       united states
> 2     Canada  1       maple leaf
> Time taken: 0.1 seconds, Fetched: 2 row(s)
> hive> ALTER TABLE t1 COMPACT 'MAJOR';
> Compaction enqueued.
> OK
> Time taken: 0.026 seconds
> hive> show compactions;
> OK
> Database      Table   Partition       Type    State   Worker  Start Time
> Time taken: 0.022 seconds, Fetched: 1 row(s)
> hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/;
> Found 3 items
> -rwxr-xr-x   1 wzheng staff        112 2015-12-14 15:51 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000000_0
> -rwxr-xr-x   1 wzheng staff        472 2015-12-14 15:51 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/000001_0
> drwxr-xr-x   - wzheng staff        204 2015-12-14 15:55 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/base_0000007
> hive> dfs -ls /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/base_0000007;
> Found 2 items
> -rw-r--r--   1 wzheng staff        222 2015-12-14 15:55 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/base_0000007/bucket_00000
> -rw-r--r--   1 wzheng staff        802 2015-12-14 15:55 
> /Users/wzheng/hivetmp/warehouse/acidtest.db/t1/base_0000007/bucket_00001
> hive> select * from t1;
> OK
> 2     Canada  1       maple leaf
> Time taken: 0.396 seconds, Fetched: 1 row(s)
> hive> select count(*) from t1;
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. tez, 
> spark) or using Hive 1.X releases.
> Query ID = wzheng_20151214155028_630098c6-605f-4e7e-a797-6b49fb48360d
> Total jobs = 1
> Launching Job 1 out of 1
> Number of reduce tasks determined at compile time: 1
> In order to change the average load for a reducer (in bytes):
>   set hive.exec.reducers.bytes.per.reducer=<number>
> In order to limit the maximum number of reducers:
>   set hive.exec.reducers.max=<number>
> In order to set a constant number of reducers:
>   set mapreduce.job.reduces=<number>
> Job running in-process (local Hadoop)
> 2015-12-14 15:56:20,277 Stage-1 map = 100%,  reduce = 100%
> Ended Job = job_local1720993786_0003
> MapReduce Jobs Launched:
> Stage-Stage-1:  HDFS Read: 0 HDFS Write: 0 SUCCESS
> Total MapReduce CPU Time Spent: 0 msec
> OK
> 1
> Time taken: 1.623 seconds, Fetched: 1 row(s)
> {code}
> Note, the cleanup doesn't kick in because the compaction fails already. The 
> cleanup itself doesn't have any problem (at least not that we know of for 
> this case).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HIVE-12724) ACID: Major compaction fails to include the original bucket files into MR job

Reply via email to