Hi, I'm setting up a Hadoop cluster and would like to understand how much disk space I should expect to need with joins.
Let's assume that I have 2 tables, each of about 500 GB. Since the tables are large, these will all be reduce-side joins. As far as I know about such joins, the data generated is a cross product of the size of the two tables. Am I wrong? In other words, for a reduce-side join in Hive involving 2 such tables, would I need to accommodate for 500 GB * 500 GB = *250000 *GB of * intermediate* (map-side output) data before the reducer(s) kick-in in my cluster? Or am I missing something? That seems rediculously high, so I hope I'm mistaken. But if the above IS accurate, what are the ways to reduce this consumption for the same kind of join in Hive? Thanks, Safdar