Hao Zhu created HIVE-9398:
-----------------------------

             Summary: Hive did not start small file merge if the source table 
has .deflate files
                 Key: HIVE-9398
                 URL: https://issues.apache.org/jira/browse/HIVE-9398
             Project: Hive
          Issue Type: Bug
          Components: Compression
    Affects Versions: 0.12.0
            Reporter: Hao Zhu


My lab Env:
Hive 0.13 

If the source table has .deflate compressed files and there is where condition,
Hive did not start small file merge feature.
If we have one partition table, and if we run SQL like:
INSERT OVERWRITE TABLE target
select xxx from source where...;

After that, "target" table has many empty files, and the number of files = the
number of mappers. 

I can reproduce it in house, and here is minimum reproduce.
Is it by design or do we need to fix it?

----------------------------------------
---------------Reproduce----------------
----------------------------------------

1. Create a source tables -- "source_support" and "source_support2" with the
same DDL.
"source_support" is to store normal text files, "source_support2" will have
.deflate compressed files.

CREATE  TABLE source_support(
onecol string 
)
PARTITIONED BY (
  partcol string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

CREATE  TABLE source_support2(
onecol string 
)
PARTITIONED BY (
  partcol string)
ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
STORED AS INPUTFORMAT
  'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat';

2. Create a one-row data file:
# cat /root/hao/000000_0
'abc'

3. Loading to 3 partitions of "source_support":

LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
PARTITION(partcol='2015-01-01');
LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
PARTITION(partcol='2015-01-02');
LOAD DATA LOCAL INPATH '/root/hao/000000_0' INTO TABLE source_support
PARTITION(partcol='2015-01-03');

hive> select * from source_support;
OK
'abc'    2015-01-01
'abc'    2015-01-02
'abc'    2015-01-03
Time taken: 0.836 seconds, Fetched: 3 row(s)

4. Loading to "source_support2" from "source_support" to generate deflate
files.
set hive.exec.compress.output=true;
INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-01')
select onecol from source_support where
partcol='2015-01-01';

set hive.exec.compress.output=true;
INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-02')
select onecol from source_support where
partcol='2015-01-02';

set hive.exec.compress.output=true;
INSERT OVERWRITE TABLE source_support2 PARTITION (partcol='2015-01-03')
select onecol from source_support where
partcol='2015-01-03';


5. Source has .deflate files even though the small file merge is enabled.

drop table testbysupport2;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
create table testbysupport2 as 
SELECT 'policy-sale' data_source
FROM source_support2
WHERE onecol = '2015.01.04' and partcol in
('2015-01-01','2015-01-02','2015-01-03');

[root@n3a warehouse]# ls -altr testbysupport2
total 1
drwxr-xr-x 42 xxx xxx 42 Jan 13 14:34 ..
-rwxr-xr-x  1 root root  0 Jan 13 14:34 000002_0
-rwxr-xr-x  1 root root  0 Jan 13 14:34 000001_0
-rwxr-xr-x  1 root root  0 Jan 13 14:34 000000_0
drwxr-xr-x  2 root root  3 Jan 13 14:34 .

6. If we remove the where condition "onecol = '2015.01.04'",
small file merge is now enabled.

drop table testbysupport2;
set hive.merge.mapfiles=true;
set hive.merge.mapredfiles=true;
create table testbysupport2 as 
SELECT 'policy-sale' data_source
FROM source_support2
WHERE  partcol in ('2015-01-01','2015-01-02','2015-01-03');

[root@n3a warehouse]# ls -altr testbysupport2
total 2
drwxr-xr-x 42 xxx xxx 42 Jan 13 14:37 ..
-rwxr-xr-x  1 root root 36 Jan 13 14:37 000000_0
drwxr-xr-x  2 root root  1 Jan 13 14:37 .
----------------------------------------
---------------Reproduce----------------
----------------------------------------



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to