Thanks.  mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the 
problem.  It is now saturating the cluster and running the query super fast.  
Excellent!

On Sep 30, 2013, at 12:28 , Sean Busbey wrote:

> Hey Keith,
> 
> It sounds like you should tweak the settings for how Hive handles query 
> execution[1]:
> 
> 1) Tune the guessed number of reducers based on input size
> 
> = hive.exec.reducers.bytes.per.reducer
> 
> Defaults to 1G. Based on your description, it sounds like this is probably 
> still at default.
> 
> In this case, you should also set a max # of reducers based on your cluster 
> size.
> 
> = hive.exec.reducers.max
> 
> I usually set this to the # reduce slots, if there's a decent chance I'll get 
> to saturate the cluster. If not, don't worry about it.
> 
> 2) Hard code a number of reducers
> 
> = mapred.reduce.tasks
> 
> Setting this will cause Hive to always use that number. It defaults to -1, 
> which tells hive to use the heuristic about input size to guess.
> 
> In either of the above cases, you should look at the options to merge small 
> files (search for "merge"  in the configuration property list) to avoid 
> getting lots of little outputs.
> 
> HTH
> 
> [1]: 
> https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution
> 
> -Sean
> 
> On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley <kwi...@keithwiley.com> wrote:
> I have a query that doesn't use reducers as efficiently as I would hope.  If 
> I run it on a large table, it uses more reducers, even saturating the 
> cluster, as I desire.  However, on smaller tables it uses as low as a single 
> reducer.  While I understand there is a logic in this (not using multiple 
> reducers until the data size is larger), it is nevertheless inefficient to 
> run a query for thirty minutes leaving the entire cluster vacant when the 
> query could distribute the work evenly and wrap things up in a fraction of 
> the time.  The query is shown below (abstracted to its basic form).  As you 
> can see, it is a little atypical: it is a nested query which obviously 
> implies two map-reduce jobs and it uses a script for the reducer stage that I 
> am trying to speed up.  I thought the "distribute by" clause should make it 
> use the reducers more evenly, but as I said, that is not the behavior I am 
> seeing.
> 
> Any ideas how I could improve this situation?
> 
> Thanks.
> 
> CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
> SELECT * FROM (
>         FROM (
>                 SELECT * FROM input_table
>                 DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, 
> input_column_2 ASC, input_column_etc ASC) q
>         SELECT TRANSFORM(*)
>         USING 'python my_reducer_script.py' AS(
>         output_column_1,
>         output_column_2,
>         output_column_etc,
>         )
> ) s
> ORDER BY output_column_1;
> 
> ________________________________________________________________________________
> Keith Wiley     kwi...@keithwiley.com     keithwiley.com    
> music.keithwiley.com
> 
> "Luminous beings are we, not this crude matter."
>                                            --  Yoda
> ________________________________________________________________________________
> 
> 
> 
> 
> -- 
> Sean


________________________________________________________________________________
Keith Wiley     kwi...@keithwiley.com     keithwiley.com    music.keithwiley.com

"I do not feel obliged to believe that the same God who has endowed us with
sense, reason, and intellect has intended us to forgo their use."
                                           --  Galileo Galilei
________________________________________________________________________________

Reply via email to