[ https://issues.apache.org/jira/browse/HIVE-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hao Zhu updated HIVE-9398: -------------------------- Affects Version/s: 0.13.0 > Hive did not start small file merge if the source table has .deflate files > -------------------------------------------------------------------------- > > Key: HIVE-9398 > URL: https://issues.apache.org/jira/browse/HIVE-9398 > Project: Hive > Issue Type: Bug > Components: Compression > Affects Versions: 0.12.0, 0.13.0 > Reporter: Hao Zhu > > My lab Env: > Hive 0.13 > If the source table has .deflate compressed files and there is where > condition, > Hive did not start small file merge feature. > If we have one partition table, and if we run SQL like: > INSERT OVERWRITE TABLE target > select xxx from source where...; > After that, "target" table has many empty files, and the number of files = the > number of mappers. > I can reproduce it in house, and here is minimum reproduce. > Is it by design or do we need to fix it? > ---------------------------------------- > ---------------Reproduce---------------- > ---------------------------------------- > 1. Create a source tables -- "source_support" and "source_support2" with the > same DDL. > "source_support" is to store normal text files, "source_support2" will have > .deflate compressed files. > CREATE TABLE source_support( > onecol string > ) > PARTITIONED BY ( > partcol string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; > CREATE TABLE source_support2( > onecol string > ) > PARTITIONED BY ( > partcol string) > ROW FORMAT SERDE > 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe' > STORED AS INPUTFORMAT > 'org.apache.hadoop.mapred.TextInputFormat' > OUTPUTFORMAT > 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'; > 2. Create a one-row data file: > # cat /root/hao/000000_0 > 'abc' > 3. Loading to 3 partitions of "source_support": > LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support > PARTITION(partcol='2015-01-01'); > LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support > PARTITION(partcol='2015-01-02'); > LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support > PARTITION(partcol='2015-01-03'); > hive> select * from source_support; > OK > 'abc' 2015-01-01 > 'abc' 2015-01-02 > 'abc' 2015-01-03 > Time taken: 0.836 seconds, Fetched: 3 row(s) > 4. Loading to "source_support2" from "source_support" to generate deflate > files. > set hive.exec.compress.output=true; > INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-01') > select onecol from source_support where > partcol='2015-01-01'; > set hive.exec.compress.output=true; > INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-02') > select onecol from source_support where > partcol='2015-01-02'; > set hive.exec.compress.output=true; > INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-03') > select onecol from source_support where > partcol='2015-01-03'; > 5. Source has .deflate files even though the small file merge is enabled. > drop table testbysupport2; > set hive.merge.mapfiles=true; > set hive.merge.mapredfiles=true; > create table testbysupport2 as > SELECT 'policy-sale' data_source > FROM source_support2 > WHERE onecol = '2015.01.04' and partcol in > ('2015-01-01','2015-01-02','2015-01-03'); > [root@n3a warehouse]# ls -altr testbysupport2 > total 1 > drwxr-xr-x 42 xxx xxx 42 Jan 13 14:34 .. > -rwxr-xr-x 1 root root 0 Jan 13 14:34 000002_0 > -rwxr-xr-x 1 root root 0 Jan 13 14:34 000001_0 > -rwxr-xr-x 1 root root 0 Jan 13 14:34 000000_0 > drwxr-xr-x 2 root root 3 Jan 13 14:34 . > 6. If we remove the where condition "onecol = '2015.01.04'", > small file merge is now enabled. > drop table testbysupport2; > set hive.merge.mapfiles=true; > set hive.merge.mapredfiles=true; > create table testbysupport2 as > SELECT 'policy-sale' data_source > FROM source_support2 > WHERE partcol in ('2015-01-01','2015-01-02','2015-01-03'); > [root@n3a warehouse]# ls -altr testbysupport2 > total 2 > drwxr-xr-x 42 xxx xxx 42 Jan 13 14:37 .. > -rwxr-xr-x 1 root root 36 Jan 13 14:37 000000_0 > drwxr-xr-x 2 root root 1 Jan 13 14:37 . > ---------------------------------------- > ---------------Reproduce---------------- > ---------------------------------------- -- This message was sent by Atlassian JIRA (v6.3.4#6332)