[ 
https://issues.apache.org/jira/browse/HIVE-9398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hao Zhu updated HIVE-9398:
--------------------------
    Affects Version/s: 0.13.0

> Hive did not start small file merge if the source table has .deflate files
> --------------------------------------------------------------------------
>
>                 Key: HIVE-9398
>                 URL: https://issues.apache.org/jira/browse/HIVE-9398
>             Project: Hive
>          Issue Type: Bug
>          Components: Compression
>    Affects Versions: 0.12.0, 0.13.0
>            Reporter: Hao Zhu
>
> My lab Env:
> Hive 0.13 
> If the source table has .deflate compressed files and there is where 
> condition,
> Hive did not start small file merge feature.
> If we have one partition table, and if we run SQL like:
> INSERT OVERWRITE TABLE target
> select xxx from source where...;
> After that, "target" table has many empty files, and the number of files = the
> number of mappers. 
> I can reproduce it in house, and here is minimum reproduce.
> Is it by design or do we need to fix it?
> ----------------------------------------
> ---------------Reproduce----------------
> ----------------------------------------
> 1. Create a source tables -- "source_support" and "source_support2" with the
> same DDL.
> "source_support" is to store normal text files, "source_support2" will have
> .deflate compressed files.
> CREATE  TABLE source_support(
> onecol string 
> )
> PARTITIONED BY (
>   partcol string)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
> CREATE  TABLE source_support2(
> onecol string 
> )
> PARTITIONED BY (
>   partcol string)
> ROW FORMAT SERDE
>   'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
> STORED AS INPUTFORMAT
>   'org.apache.hadoop.mapred.TextInputFormat'
> OUTPUTFORMAT
>   'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';
> 2. Create a one-row data file:
> # cat /root/hao/000000_0
> 'abc'
> 3. Loading to 3 partitions of "source_support":
> LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
> PARTITION(partcol='2015-01-01');
> LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
> PARTITION(partcol='2015-01-02');
> LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
> PARTITION(partcol='2015-01-03');
> hive> select * from source_support;
> OK
> 'abc'    2015-01-01
> 'abc'    2015-01-02
> 'abc'    2015-01-03
> Time taken: 0.836 seconds, Fetched: 3 row(s)
> 4. Loading to "source_support2" from "source_support" to generate deflate
> files.
> set hive.exec.compress.output=true;
> INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-01')
> select onecol from source_support where
> partcol='2015-01-01';
> set hive.exec.compress.output=true;
> INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-02')
> select onecol from source_support where
> partcol='2015-01-02';
> set hive.exec.compress.output=true;
> INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-03')
> select onecol from source_support where
> partcol='2015-01-03';
> 5. Source has .deflate files even though the small file merge is enabled.
> drop table testbysupport2;
> set hive.merge.mapfiles=true;
> set hive.merge.mapredfiles=true;
> create table testbysupport2 as 
> SELECT 'policy-sale' data_source
> FROM source_support2
> WHERE onecol = '2015.01.04' and partcol in
> ('2015-01-01','2015-01-02','2015-01-03');
> [root@n3a warehouse]# ls -altr testbysupport2
> total 1
> drwxr-xr-x 42 xxx xxx 42 Jan 13 14:34 ..
> -rwxr-xr-x  1 root root  0 Jan 13 14:34 000002_0
> -rwxr-xr-x  1 root root  0 Jan 13 14:34 000001_0
> -rwxr-xr-x  1 root root  0 Jan 13 14:34 000000_0
> drwxr-xr-x  2 root root  3 Jan 13 14:34 .
> 6. If we remove the where condition "onecol = '2015.01.04'",
> small file merge is now enabled.
> drop table testbysupport2;
> set hive.merge.mapfiles=true;
> set hive.merge.mapredfiles=true;
> create table testbysupport2 as 
> SELECT 'policy-sale' data_source
> FROM source_support2
> WHERE  partcol in ('2015-01-01','2015-01-02','2015-01-03');
> [root@n3a warehouse]# ls -altr testbysupport2
> total 2
> drwxr-xr-x 42 xxx xxx 42 Jan 13 14:37 ..
> -rwxr-xr-x  1 root root 36 Jan 13 14:37 000000_0
> drwxr-xr-x  2 root root  1 Jan 13 14:37 .
> ----------------------------------------
> ---------------Reproduce----------------
> ----------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to