Marco, check out this UDF: http://pig.apache.org/docs/r0.8.1/api/org/apache/pig/piggybank/storage/MultiStorage.html
I think it can get the job done without having to group everything. Cheers, Rodrigo 2015-01-08 7:27 GMT-02:00 Marco Cadetg <[email protected]>: > Hi there, > > I've a big pig script which first generates some expensive intermediate > result on which I run multiple group by statements and multiple stores. > Something like this. > > Register UDFs etc > A = LOAD.... > B = LOAD.... > C = LOAD.... > > -- do lots of transformations with A and B and C get intermediate result > INTER_RES > result1 = FOREACH (GROUP INTER_RES BY (... > STORE result1 INTO '.... > result2 = FOREACH (GROUP INTER_RES BY (... > STORE result2 INTO '.... > result3 = FOREACH (GROUP INTER_RES BY (... > STORE result3 INTO '.... > result4 = FOREACH (GROUP INTER_RES BY (... > STORE result4 INTO '.... > ... > ... > > Note the results which get stored are independent off each other. Meaning > they are not getting used as an input for anything else further down and do > also not alter the INTER_RES. > > Am I correct that pig would only need to LOAD A, B and C once? From what I > can see on the command line output it looks like the expensive intermediate > is computed every time for each store. I've done a quick test and if I do a > STORE of the intermediate and LOAD that it seems to be faster. Is there a > way to avoid this storing of the expensive intermediate? > > Cheers, > -Marco >
