Hao Zhu created HIVE-9398: ----------------------------- Summary: Hive did not start small file merge if the source table has .deflate files Key: HIVE-9398 URL: https://issues.apache.org/jira/browse/HIVE-9398 Project: Hive Issue Type: Bug Components: Compression Affects Versions: 0.12.0 Reporter: Hao Zhu
My lab Env: Hive 0.13 If the source table has .deflate compressed files and there is where condition, Hive did not start small file merge feature. If we have one partition table, and if we run SQL like: INSERT OVERWRITE TABLE target select xxx from source where...; After that, "target" table has many empty files, and the number of files = the number of mappers. I can reproduce it in house, and here is minimum reproduce. Is it by design or do we need to fix it? ---------------------------------------- ---------------Reproduce---------------- ---------------------------------------- 1. Create a source tables -- "source_support" and "source_support2" with the same DDL. "source_support" is to store normal text files, "source_support2" will have .deflate compressed files. CREATE TABLE source_support( onecol string ) PARTITIONED BY ( partcol string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; CREATE TABLE source_support2( onecol string ) PARTITIONED BY ( partcol string) ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; 2. Create a one-row data file: # cat /root/hao/000000_0 'abc' 3. Loading to 3 partitions of "source_support": LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support PARTITION(partcol='2015-01-01'); LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support PARTITION(partcol='2015-01-02'); LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support PARTITION(partcol='2015-01-03'); hive> select * from source_support; OK 'abc' 2015-01-01 'abc' 2015-01-02 'abc' 2015-01-03 Time taken: 0.836 seconds, Fetched: 3 row(s) 4. Loading to "source_support2" from "source_support" to generate deflate files. set hive.exec.compress.output=true; INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-01') select onecol from source_support where partcol='2015-01-01'; set hive.exec.compress.output=true; INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-02') select onecol from source_support where partcol='2015-01-02'; set hive.exec.compress.output=true; INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-03') select onecol from source_support where partcol='2015-01-03'; 5. Source has .deflate files even though the small file merge is enabled. drop table testbysupport2; set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; create table testbysupport2 as SELECT 'policy-sale' data_source FROM source_support2 WHERE onecol = '2015.01.04' and partcol in ('2015-01-01','2015-01-02','2015-01-03'); [root@n3a warehouse]# ls -altr testbysupport2 total 1 drwxr-xr-x 42 xxx xxx 42 Jan 13 14:34 .. -rwxr-xr-x 1 root root 0 Jan 13 14:34 000002_0 -rwxr-xr-x 1 root root 0 Jan 13 14:34 000001_0 -rwxr-xr-x 1 root root 0 Jan 13 14:34 000000_0 drwxr-xr-x 2 root root 3 Jan 13 14:34 . 6. If we remove the where condition "onecol = '2015.01.04'", small file merge is now enabled. drop table testbysupport2; set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; create table testbysupport2 as SELECT 'policy-sale' data_source FROM source_support2 WHERE partcol in ('2015-01-01','2015-01-02','2015-01-03'); [root@n3a warehouse]# ls -altr testbysupport2 total 2 drwxr-xr-x 42 xxx xxx 42 Jan 13 14:37 .. -rwxr-xr-x 1 root root 36 Jan 13 14:37 000000_0 drwxr-xr-x 2 root root 1 Jan 13 14:37 . ---------------------------------------- ---------------Reproduce---------------- ---------------------------------------- -- This message was sent by Atlassian JIRA (v6.3.4#6332)