Re: query resulting in many small output files causes timeout error in Hue

Tim Thu, 21 Nov 2013 10:56:30 -0800

Or setting reducers to 1 and doing a GROUP BY all columns forces a single file 
too.


Tim,
Sent from my iPhone (which makes terrible auto-correct spelling mistakes)

> On 21 Nov 2013, at 18:27, Eric Chu <e...@rocketfuel.com> wrote:
> 
> Hi,
> 
> We often have map-only queries that result in a large number of small output 
> files (in the thousands). Although this doesn't affect CLI, when users try to 
> view/download the query result in Hue, Hue would time out in trying to read 
> all these small files. We tried to set the following properties that 
> supposedly will make Hive launch an extra MR job to merge these files when 
> the average file size is smaller than some threshold, but it's not working:
> hive.merge.mapfiles = true
> hive.merge.mapredfiles = true
> hive.merge.smallfiles.avgsize = 32000000 (Default is 16000000)
> In Hive 10, we used to have hive.mergejob.maponly set to true, but this 
> property does not exist in Hive 11 and 12. What's the story behind this?
> For example, in the following select-from-where query on a partitioned table 
> in RCFile, there would be two root stages - one doing a scan with filter and 
> the other doing a fetch.
> 
> Query:
> 
> select data_date as date, ID, if(col_10=1, "yes","no") as answer
> from table_1
> where arr[4] <> "0"
> and lookup("table_1", x,"action_id")=20519251
> and data_date>=20131014
> 
> Query Plan:
> 
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> 
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Alias -> Map Operator Tree:
>         table_1
>           TableScan
>             alias: table_1
>             Filter Operator
>               predicate:
>                   expr: ((arr[4] <> '0') and (dim_lookup('table_1', x, 
> 'action_id') = 20519251))
>                   type: boolean
>               Select Operator
>                 expressions:
>                       expr: data_date
>                       type: string
>                       expr: ID
>                       type: string
>                       expr: if((col_10= 1), 'yes', 'no')
>                       type: string
>                 outputColumnNames: _col0, _col1, _col2
>                 File Output Operator
>                   compressed: true
>                   GlobalTableId: 0
>                   table:
>                       input format: org.apache.hadoop.mapred.TextInputFormat
>                       output format: 
> org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
> 
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> 
> The query leads to 6253 output files, and the total size is 86427 bytes. Many 
> of the files have 8 bytes and the ones that have more than 8 bytes usually 
> have ~30 bytes. With the aforementioned settings, I'd expect an extra MR job 
> to merge the files, but that didn't happen. 
> 
> If anyone has some insights please let me know.
> 
> Thanks,
> 
> Eric

Re: query resulting in many small output files causes timeout error in Hue

Reply via email to