It's strange that it's being executed on the Map-side. The group is a reduce side operation (I'm assuming) and it seems that the nested foreach would happen on Reduce-side after grouping. Have you looked at the MR plan to verify that it is being executed Map-side?
One thing to try might be to CROSS first before grouping... although that might be 2 reduce steps. On Mon, Jan 20, 2014 at 1:27 AM, Serega Sheypak <[email protected]>wrote: > Hi, I'm in trouble > Here a part of code: > > itemGrp = GROUP itemProj1 BY sale_id PARALLEL 12; > notFiltered = FOREACH itemGrp{ > itemProj2 = FOREACH itemProj1 > GENERATE FLATTEN( > TOTUPLE(id, other_id)) as > (id, other_id); > > crossed = CROSS itemProj1, itemProj2; > filtered = FILTER crossed by ( > --some cond > ); > projected = FOREACH filtered GENERATE f1, f2, f3; > GENERATE FLATTEN(projected) as (f1, f2,f3); > } > > The problem is that all this stuff is executed on map phase. But i want it > to be executed on reduce phase to get parallelism benfit. > Now only two mappers (not to much data before CROSS explosion) perform > cross inside groups and complicated filtering. > > I can't find a way to make it run on reduce-phase... > What do I do wrong? >
