Filter first and it will do it in a single scan and will make the join faster.
http://pig.apache.org/docs/r0.11.1/perf.html#filter

On Tue, May 21, 2013 at 8:28 PM, Thomas Edison
<[email protected]> wrote:
> Here is a code sample:
>
> a = load 'fact' as (dim_key:chararray, fact_value:int);
> b = load 'dim';
>
> c = join a by dim_key, b by dim_key using 'replicated';
> d = filter c by fact_value > 10;
>
> dump d;
>
> Let's assume both c and d will filter out a lot of records.  Is there a way
> these two step can be done in one scan of the fact data, rather two?  Or
> the optimization is smart enough to figure out and only do one scan?
>
> Thanks.
>
> T.E.

Reply via email to