Hengyu Dai created HIVE-18234: --------------------------------- Summary: Hive MergeFileTask doesn't work correctly Key: HIVE-18234 URL: https://issues.apache.org/jira/browse/HIVE-18234 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 2.1.1 Reporter: Hengyu Dai
For MergeFileTask, Hive will read hive.merge.mapfiles, hive.merge.mapredfiles, hive.merge.size.per.task, hive.merge.smallfiles.avgsize these property to determine whether to generate a MergeFileTask to merge small files, if merge is needed, then hive will generate a MergeFileTask/MapWork to merge files, the property will finally be set to MapWork#maxSplitSize, maxSplitSize#minSplitSize, maxSplitSize#minSplitSizePerNode, minSplitSizePerRack#minSplitSizePerRack. But Hive doesn't use these settings when commit Map task to Hadoop, i.e., the corresponding settings of Hadoop: "mapred.max.split.size" "mapred.min.split.size.per.node" "mapred.min.split.size.per.rack" are not set by these Hive setting. SO, those Hive setting does not take effect for MergeFileTask. steps to reproduce: this sql will still produce many small files(less than 20MB) {code:sql} set hive.merge.mapredfiles=true; set hive.merge.mapfiles=true; set hive.merge.smallfiles.avgsize=500000000; set hive.merge.size.per.task=1000000000; insert overwrite table foo partition(dt='20171203') select * from bar; {code} to fix these problem, I think we should set these property to Hadoop in MergeFileTask, those code takes effect to me {code:java} // in MergeFileTask#execute() job.setInputFormat(work.getInputformatClass()); job.setOutputFormat(HiveOutputFormatImpl.class); job.setMapperClass(MergeFileMapper.class); job.setMapOutputKeyClass(NullWritable.class); job.setMapOutputValueClass(NullWritable.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(NullWritable.class); job.setNumReduceTasks(0); // set these property job.setLong("mapred.max.split.size", work.getMaxSplitSize()); job.setLong("mapred.min.split.size.per.rack", work.getMinSplitSizePerRack()); job.setLong("mapred.min.split.size.per.node", work.getMinSplitSizePerNode()); {code} -- This message was sent by Atlassian JIRA (v6.4.14#64029)