Hello Pig experts,
I have the following simple script. For simplicity, I have replaced my UDF with
this dummy UDF that shows the problem that I am having. UDF TupleTest generates
a tuple in the following manner:
boolean randomboolean = rngen.nextBoolean();
if(randomboolean)
{
output.set(0, 1);
output.set(1, "Black");
}
else
{
output.set(0, 0);
output.set(1, "White");
}
Pig script:
REGISTER /N/u/sameer/software/pig-0.11.1/myudfs.jar
DEFINE SequenceFileLoader org.apache.pig.piggybank.storage.SequenceFileLoader();
A = LOAD '/scratch/file.seq' USING SequenceFileLoader AS (key: chararray,
value: chararray);
AU = FOREACH A GENERATE FLATTEN(myudfs.TupleTest(key, value)) AS (randbool:
int, randstr: chararray);
STORE AU into '/scratch/AU';
B = GROUP AU BY randbool;
STORE B into '/scratch/B';
X = FOREACH B GENERATE group, COUNT(AU);
DUMP X;
Here is the sample o/p:
hadoop --config $HADOOP_CONF_DIR fs -cat /scratch/AU/part-m-00000
Warning: $HADOOP_HOME is deprecated.
1 Black
1 Black
0 White
1 Black
hadoop --config $HADOOP_CONF_DIR fs -cat /scratch/B/part-r-00000
Warning: $HADOOP_HOME is deprecated.
0 {(0,White)}
1 {(1,Black),(1,Black),(1,Black)}
X:
(0,2)
(1,2)
As you can see, X is wrong, it should be: (0,1), (1,3). Can you please help me
with this?