Filter first and it will do it in a single scan and will make the join faster. http://pig.apache.org/docs/r0.11.1/perf.html#filter
On Tue, May 21, 2013 at 8:28 PM, Thomas Edison <[email protected]> wrote: > Here is a code sample: > > a = load 'fact' as (dim_key:chararray, fact_value:int); > b = load 'dim'; > > c = join a by dim_key, b by dim_key using 'replicated'; > d = filter c by fact_value > 10; > > dump d; > > Let's assume both c and d will filter out a lot of records. Is there a way > these two step can be done in one scan of the fact data, rather two? Or > the optimization is smart enough to figure out and only do one scan? > > Thanks. > > T.E.
