I have data that looks like this:
a e 11 0
b f 2 2
c g 3 3
c h 44 44
c i 75 0
d j 89 0
d k 120 0
d l 3000 0
and I load it like so:
data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
iid:chararray, num1:int, num2:int);
I want to group by the first column, cid. For each group, if any of the
num2 values (last column) are positive, I want to output every tuple in
that group with an extra field equal to num1. If all the num2 values for
that group are zero, then I want to output every tuple in that group with
an extra field equal to 0.
I figured something like this would work:
data = load 'test.txt' using PigStorage(' ') as (cid:chararray,
iid:chararray, num1:int, num2:int);
grouped = group data by cid;
results = foreach grouped {
result1 = SUM(data.num2);
extended = foreach data generate *, result1 > 0 ? num1 : 0;
generate FLATTEN(extended);
};
but it does not. I get this error:
2013-01-22 17:15:07,647 [main] ERROR org.apache.pig.tools.grunt.Grunt -
ERROR 1200: <line 98, column 48> mismatched input '>' expecting SEMI_COLON
What is the proper way to do this? From the MapReduce perspective, I group
by the key, and in the reducer, I compute a value for each group, and then
emit every single value for that group along with some extra data.
Thanks!
Uri
--
Uri Laserson, PhD
Data Scientist, Cloudera
Twitter/GitHub: @laserson
+1 617 910 0447
[email protected]