Hi list,

I am trying to run a Join query on my 10 node cluster. My query looks as
follows

select * from A JOIN B on (A.a = B.b)

size of A = 15 million rows
size of B = 1 million rows

The problem is A.a and B.b has around 25-30 distinct values per column
which implies that they have high selectivities and the reducers are bulky.

However the performance hit is so horrible that , ALL my reducers hang @
75% for 6 hours and doesn't move further.

The only thing that log shows up is "Join operator - forwarding rows
---------------<Huge number>" kinds of logs for all this long. What does
this mean ?
There is no swapping happening and the CPU % is constantly around 40% for
all this time (observed through Ganglia) .

Any way I can solve this problem? Can anyone help me with this?

Thanks,
jS

Reply via email to