I have a query that doesn't use reducers as efficiently as I would hope. If I run it on a large table, it uses more reducers, even saturating the cluster, as I desire. However, on smaller tables it uses as low as a single reducer. While I understand there is a logic in this (not using multiple reducers until the data size is larger), it is nevertheless inefficient to run a query for thirty minutes leaving the entire cluster vacant when the query could distribute the work evenly and wrap things up in a fraction of the time. The query is shown below (abstracted to its basic form). As you can see, it is a little atypical: it is a nested query which obviously implies two map-reduce jobs and it uses a script for the reducer stage that I am trying to speed up. I thought the "distribute by" clause should make it use the reducers more evenly, but as I said, that is not the behavior I am seeing.
Any ideas how I could improve this situation? Thanks. CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as SELECT * FROM ( FROM ( SELECT * FROM input_table DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC, input_column_2 ASC, input_column_etc ASC) q SELECT TRANSFORM(*) USING 'python my_reducer_script.py' AS( output_column_1, output_column_2, output_column_etc, ) ) s ORDER BY output_column_1; ________________________________________________________________________________ Keith Wiley kwi...@keithwiley.com keithwiley.com music.keithwiley.com "Luminous beings are we, not this crude matter." -- Yoda ________________________________________________________________________________