Thanks. mapred.reduce.tasks and hive.exec.reducers.max seem to have fixed the problem. It is now saturating the cluster and running the query super fast. Excellent!
On Sep 30, 2013, at 12:28 , Sean Busbey wrote: > Hey Keith, > > It sounds like you should tweak the settings for how Hive handles query > execution[1]: > > 1) Tune the guessed number of reducers based on input size > > = hive.exec.reducers.bytes.per.reducer > > Defaults to 1G. Based on your description, it sounds like this is probably > still at default. > > In this case, you should also set a max # of reducers based on your cluster > size. > > = hive.exec.reducers.max > > I usually set this to the # reduce slots, if there's a decent chance I'll get > to saturate the cluster. If not, don't worry about it. > > 2) Hard code a number of reducers > > = mapred.reduce.tasks > > Setting this will cause Hive to always use that number. It defaults to -1, > which tells hive to use the heuristic about input size to guess. > > In either of the above cases, you should look at the options to merge small > files (search for "merge" in the configuration property list) to avoid > getting lots of little outputs. > > HTH > > [1]: > https://cwiki.apache.org/confluence/display/Hive/Configuration+Properties#ConfigurationProperties-QueryExecution > > -Sean > > On Mon, Sep 30, 2013 at 11:31 AM, Keith Wiley <kwi...@keithwiley.com> wrote: > I have a query that doesn't use reducers as efficiently as I would hope. If > I run it on a large table, it uses more reducers, even saturating the > cluster, as I desire. However, on smaller tables it uses as low as a single > reducer. While I understand there is a logic in this (not using multiple > reducers until the data size is larger), it is nevertheless inefficient to > run a query for thirty minutes leaving the entire cluster vacant when the > query could distribute the work evenly and wrap things up in a fraction of > the time. The query is shown below (abstracted to its basic form). As you > can see, it is a little atypical: it is a nested query which obviously > implies two map-reduce jobs and it uses a script for the reducer stage that I > am trying to speed up. I thought the "distribute by" clause should make it > use the reducers more evenly, but as I said, that is not the behavior I am > seeing. > > Any ideas how I could improve this situation? > > Thanks. > > CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as > SELECT * FROM ( > FROM ( > SELECT * FROM input_table > DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, > input_column_2 ASC, input_column_etc ASC) q > SELECT TRANSFORM(*) > USING 'python my_reducer_script.py' AS( > output_column_1, > output_column_2, > output_column_etc, > ) > ) s > ORDER BY output_column_1; > > ________________________________________________________________________________ > Keith Wiley kwi...@keithwiley.com keithwiley.com > music.keithwiley.com > > "Luminous beings are we, not this crude matter." > -- Yoda > ________________________________________________________________________________ > > > > > -- > Sean ________________________________________________________________________________ Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com "I do not feel obliged to believe that the same God who has endowed us with sense, reason, and intellect has intended us to forgo their use." -- Galileo Galilei ________________________________________________________________________________