I have a query that doesn't use reducers as efficiently as I would hope. If I
run it on a large table, it uses more reducers, even saturating the cluster, as
I desire. However, on smaller tables it uses as low as a single reducer.
While I understand there is a logic in this (not using multiple reducers until
the data size is larger), it is nevertheless inefficient to run a query for
thirty minutes leaving the entire cluster vacant when the query could
distribute the work evenly and wrap things up in a fraction of the time. The
query is shown below (abstracted to its basic form). As you can see, it is a
little atypical: it is a nested query which obviously implies two map-reduce
jobs and it uses a script for the reducer stage that I am trying to speed up.
I thought the "distribute by" clause should make it use the reducers more
evenly, but as I said, that is not the behavior I am seeing.
Any ideas how I could improve this situation?
Thanks.
CREATE TABLE output_table ROW FORMAT DELIMITED FIELDS TERMINATED BY '|' as
SELECT * FROM (
FROM (
SELECT * FROM input_table
DISTRIBUTE BY input_column_1 SORT BY input_column_1 ASC,
input_column_2 ASC, input_column_etc ASC) q
SELECT TRANSFORM(*)
USING 'python my_reducer_script.py' AS(
output_column_1,
output_column_2,
output_column_etc,
)
) s
ORDER BY output_column_1;
________________________________________________________________________________
Keith Wiley [email protected] keithwiley.com music.keithwiley.com
"Luminous beings are we, not this crude matter."
-- Yoda
________________________________________________________________________________