I have a query that doesn't use reducers as efficiently as I would hope.  If I 
run it on a large table, it uses more reducers, even saturating the cluster, as 
I desire.  However, on smaller tables it uses as low as a single reducer.  
While I understand there is a logic in this (not using multiple reducers until 
the data size is larger), it is nevertheless inefficient to run a query for 
thirty minutes leaving the entire cluster vacant when the query could 
distribute the work evenly and wrap things up in a fraction of the time.  The 
query is shown below (abstracted to its basic form).  As you can see, it is a 
little atypical: it is a nested query which obviously implies two map-reduce 
jobs and it uses a script for the reducer stage that I am trying to speed up.  
I thought the "distribute by" clause should make it use the reducers more 
evenly, but as I said, that is not the behavior I am seeing.

Any ideas how I could improve this situation?

Thanks.

CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as 
SELECT * FROM (
        FROM (
                SELECT * FROM input_table
                DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, 
input_column_2 ASC, input_column_etc ASC) q
        SELECT TRANSFORM(*)
        USING 'python my_reducer_script.py' AS(
        output_column_1,
        output_column_2,
        output_column_etc,
        )
) s
ORDER BY output_column_1;

________________________________________________________________________________
Keith Wiley     kwi...@keithwiley.com     keithwiley.com    music.keithwiley.com

"Luminous beings are we, not this crude matter."
                                           --  Yoda
________________________________________________________________________________

Reply via email to