I think I figured it out using replicated join.
My initial understanding of the pig M/R plan was incorrect. It was
performing a reduce side join like so:
Map1.1 (LOAD A)
Map1.2 (LOAD B) -> Reduce1 (CROSS, FILTER) -> Map2 (seemingly useless)
->
Reduce2 (COUNT)
Since one of my relations is small enough to fit in memory, I can force it
to use a map side (replicated) join. Now the plan looks like this:
Map(LOAD A, LOAD B, JOIN, FILTER) -> Combine(COUNT) -> Reduce(COUNT)
On 2/9/14 12:53 PM, "Enns, Steven" <[email protected]> wrote:
>I am trying to aggregate on the cross product of two relations. It can be
>done using a single M/R job but pig is using two. The pig code looks like
>this:
>
> C = cross A, B;
> C = filter C by Š;
> G = group C by x;
> G = foreach G generate group, COUNT(G);
>
>The resulting M/R plan is this:
>
> Map1 (LOAD, CROSS) -> Reduce1 (FILTER) -> Map2 (seemingly useless) ->
>Reduce2 (COUNT)
>
>Of course, the IO between Reduce1 and Map2 is massive. This job can only
>be done efficiently if done like so:
>
> Map1 (LOAD, CROSS) -> Combine1(FILTER, COUNT) -> Reduce1(COUNT)
>
>Is there some way to force pig to use this M/R plan? Or do I have to
>write my own M/R job?
>
>Thanks!
>
>
>