Hengyu Dai created HIVE-18234:
---------------------------------

             Summary: Hive MergeFileTask doesn't work correctly
                 Key: HIVE-18234
                 URL: https://issues.apache.org/jira/browse/HIVE-18234
             Project: Hive
          Issue Type: Bug
          Components: Hive
    Affects Versions: 2.1.1
            Reporter: Hengyu Dai


For MergeFileTask, Hive will read hive.merge.mapfiles, hive.merge.mapredfiles, 
hive.merge.size.per.task, hive.merge.smallfiles.avgsize these property to 
determine whether to generate a MergeFileTask to merge small files,  if merge 
is needed, then hive will generate a MergeFileTask/MapWork to merge files, the 
property will finally be set to MapWork#maxSplitSize, 
maxSplitSize#minSplitSize, maxSplitSize#minSplitSizePerNode, 
minSplitSizePerRack#minSplitSizePerRack.

But Hive doesn't use these settings when commit Map task to Hadoop, i.e.,  the 
corresponding settings of Hadoop: "mapred.max.split.size" 
"mapred.min.split.size.per.node" "mapred.min.split.size.per.rack" are not set 
by these Hive setting. SO,  those Hive setting does not take effect for 
MergeFileTask.

steps to reproduce:
this sql will still produce many small files(less than 20MB)
{code:sql}
set hive.merge.mapredfiles=true;
set hive.merge.mapfiles=true;
set hive.merge.smallfiles.avgsize=500000000;
set hive.merge.size.per.task=1000000000;
insert overwrite table foo partition(dt='20171203')
select * from bar;
{code}

to fix these problem, I think we should set these property to Hadoop in 
MergeFileTask,
those code takes effect to me

{code:java}
      // in MergeFileTask#execute()
      job.setInputFormat(work.getInputformatClass());
      job.setOutputFormat(HiveOutputFormatImpl.class);
      job.setMapperClass(MergeFileMapper.class);
      job.setMapOutputKeyClass(NullWritable.class);
      job.setMapOutputValueClass(NullWritable.class);
      job.setOutputKeyClass(NullWritable.class);
      job.setOutputValueClass(NullWritable.class);
      job.setNumReduceTasks(0);
      // set these property 
      job.setLong("mapred.max.split.size", work.getMaxSplitSize());
      job.setLong("mapred.min.split.size.per.rack", 
work.getMinSplitSizePerRack());
      job.setLong("mapred.min.split.size.per.node", 
work.getMinSplitSizePerNode());
{code}





--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to