Marco,

check out this UDF:
http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html

I think it can get the job done without having to group everything.

Cheers,
Rodrigo

2015-01-08 7:27 GMT-02:00 Marco Cadetg <[email protected]>:

> Hi there,
>
> I've a big pig script which first generates some expensive intermediate
> result on which I run multiple group by statements and multiple stores.
> Something like this.
>
> Register UDFs etc
> A = LOAD....
> B = LOAD....
> C = LOAD....
>
> -- do lots of transformations with A and B and C get intermediate result
> INTER_RES
> result1 = FOREACH (GROUP INTER_RES BY (...
> STORE result1 INTO '....
> result2 = FOREACH (GROUP INTER_RES BY (...
> STORE result2 INTO '....
> result3 = FOREACH (GROUP INTER_RES BY (...
> STORE result3 INTO '....
> result4 = FOREACH (GROUP INTER_RES BY (...
> STORE result4 INTO '....
> ...
> ...
>
> Note the results which get stored are independent off each other. Meaning
> they are not getting used as an input for anything else further down and do
> also not alter the INTER_RES.
>
> Am I correct that pig would only need to LOAD A, B and C once? From what I
> can see on the command line output it looks like the expensive intermediate
> is computed every time for each store. I've done a quick test and if I do a
> STORE of the intermediate and LOAD that it seems to be faster. Is there a
> way to avoid this storing of the expensive intermediate?
>
> Cheers,
> -Marco
>

Reply via email to