DISTRIBUTE BY question

Connell Donaghy Mon, 13 Jul 2015 13:50:51 -0700

Hey! I'm trying to write a tool which uses a storagehandler to store
HFiles, using a specific partition function. So in order to do this, I have
been trying to use DISTRIBUTE BY and a UDF using the key column and number
of reducers (which becomes number of partitions, as each reducer creates
its own hfile.) However, I have noticed that sometimes two UDF values (say
0 and 11) will both go to reducer 0, while reducer 11 does not get any
inputs. Could you guys point me to the place in your source code where you
implement the partitioning for the map/reduce job and DISTRIBUTE BY, so
that I could try and reverse-engineer it to ensure the keys go to the right
partition? If my question doesn't make sense, just pointing me to where
DISTRIBUTE BY is implemented would be very helpful, and thank you so so
much for your time!

DISTRIBUTE BY question

Reply via email to