Or setting reducers to 1 and doing a GROUP BY all columns forces a single file too.
Tim, Sent from my iPhone (which makes terrible auto-correct spelling mistakes) > On 21 Nov 2013, at 18:27, Eric Chu <e...@rocketfuel.com> wrote: > > Hi, > > We often have map-only queries that result in a large number of small output > files (in the thousands). Although this doesn't affect CLI, when users try to > view/download the query result in Hue, Hue would time out in trying to read > all these small files. We tried to set the following properties that > supposedly will make Hive launch an extra MR job to merge these files when > the average file size is smaller than some threshold, but it's not working: > hive.merge.mapfiles = true > hive.merge.mapredfiles = true > hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000) > In Hive 10, we used to have hive.mergejob.maponly set to true, but this > property does not exist in Hive 11 and 12. What's the story behind this? > For example, in the following select-from-where query on a partitioned table > in RCFile, there would be two root stages - one doing a scan with filter and > the other doing a fetch. > > Query: > > select data_date as date, ID, if(col_10=1, "yes","no") as answer > from table_1 > where arr[4] <> "0" > and lookup("table_1", x,"action_id")=20519251 > and data_date>=20131014 > > Query Plan: > > STAGE DEPENDENCIES: > Stage-1 is a root stage > Stage-0 is a root stage > > STAGE PLANS: > Stage: Stage-1 > Map Reduce > Alias -> Map Operator Tree: > table_1 > TableScan > alias: table_1 > Filter Operator > predicate: > expr: ((arr[4] <> '0') and (dim_lookup('table_1', x, > 'action_id') = 20519251)) > type: boolean > Select Operator > expressions: > expr: data_date > type: string > expr: ID > type: string > expr: if((col_10= 1), 'yes', 'no') > type: string > outputColumnNames: _col0, _col1, _col2 > File Output Operator > compressed: true > GlobalTableId: 0 > table: > input format: org.apache.hadoop.mapred.TextInputFormat > output format: > org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat > > Stage: Stage-0 > Fetch Operator > limit: -1 > > The query leads to 6253 output files, and the total size is 86427 bytes. Many > of the files have 8 bytes and the ones that have more than 8 bytes usually > have ~30 bytes. With the aforementioned settings, I'd expect an extra MR job > to merge the files, but that didn't happen. > > If anyone has some insights please let me know. > > Thanks, > > Eric