Hi,

I'm setting up a Hadoop cluster and would like to understand how much disk
space I should expect to need with joins.

Let's assume that I have 2 tables, each of about 500 GB. Since the tables
are large, these will all be reduce-side joins. As far as I know about such
joins, the data generated is a cross product of the size of the two tables.
Am I wrong?

In other words, for a reduce-side join in Hive involving 2 such tables,
would I need to accommodate for 500 GB * 500 GB = *250000 *GB of *
intermediate* (map-side output) data before the reducer(s) kick-in in my
cluster? Or am I missing something? That seems rediculously high, so I hope
I'm mistaken.

But if the above IS accurate, what are the ways to reduce this consumption
for the same kind of join in Hive?

Thanks,
Safdar

Reply via email to