I might be wrong but I think EMR inserts a reduce job when writing data into S3. At least in my case, I am able to create a single output file by
SET mapred.reduce.tasks = 1; INSERT OVERWRITE TABLE price_history_s3 ... Without using any a combined format. The number of mappers _is_ determined by the number of input files. But I think you can't use a combined input format with Gzip files. Perhaps you could run a separate query for each partition? igor decide.com On Tue, Nov 29, 2011 at 11:18 PM, Mohit Gupta <success.mohit.gu...@gmail.com > wrote: > Hi All, > I am using hive 0.7 on Amazon EMR. I need to merge a large number of small > files into a few larger files( basically merging a number of partitions for > a table into one). On doing the obvious query, i.e.( insert into a new > partition select * from all partitions), a large number of small files are > generated in the new partition. ( map-only job with no of output files > equal to the no of mappers). > > Note: The table being processed here is stored in compressed format on s3. > set hive.exec.compress.output = true; > set mapred.output.compression.codec = > org.apache.hadoop.io.compress.GzipCodec; > set io.seqfile.compression.type = BLOCK; > > I found a couple of solutions on net but sadly neither of them work for me: > 1. Merging small files > I set the following parameters: > set hive.merge.mapfiles=true; > set hive.merge.size.per.task=256000000; > set hive.merge.smallfiles.avgsize=100000000; > set hive.merge.mapredfiles=true; > set hive.merge.smallfiles.avgsize=1000000000; > set hive.merge.size.smallfiles.avgsize=1000000000; > > Ideally, there should have been a reduce job after the map-only job to > merge the small output files into a small no. of files. But, I could see no > reduce job. > > 2. Using CombineHiveInputFormat > Parameters Set: > set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; > set mapred.min.split.size.per.node=1000000000; > set mapred.min.split.size.per.rack=1000000000; > set mapred.max.split.size=1000000000; > > Ideally, here the no. of mappers created should have > been considerably less than the no of input files, thereby producing a > small no. of output files equal to the no. of mappers. But, I found the > same no of mappers as no of input files. > > ------ > Specifics: > Approx size of small files: 125 KB > No of small files >6k > > I found a couple of links saying that this merging stuff did not work for > compressed files but now it is fixed. > Any ideas how can I fix this! > > Thanks in Advance. > > -- > Best Regards, > > Mohit Gupta > Software Engineer at Vdopia Inc. > > >