Join is a cross product in Pig and in SQL meaning that each matching pair would be retained. Since you have 2 rows on each side with the same key, you get 4 rows in the output. If you only want 1, you need to run distinct on both relationships first.
On Friday, August 29, 2014 8:59 AM, "Walker, Cameron" <[email protected]> wrote: I am stumped by what, clearly, is my misunderstanding of how PIG does a JOIN. Data in 'data' file: A1 B1 A1 B1 A2 B2 A2 B2 My little program: A = load 'data' using PigStorage(' ') AS (var1:chararray,var2:chararray); B1 = foreach A generate var1,var2; B2 = foreach A generate var1,var2; C = join B1 by (var1,var2), B2 by (var1,var2); dump C; What I expect to get: (A1,B1,A1,B1) (A2,B2,A2,B2) What I actually get: (A1,B1,A1,B1) (A1,B1,A1,B1) (A1,B1,A1,B1) (A1,B1,A1,B1) (A2,B2,A2,B2) (A2,B2,A2,B2) (A2,B2,A2,B2) (A2,B2,A2,B2) I can't understand why I'm getting 4 results instead of 1 for each pair. Clearly it's a 2*2 thing so, if I had 3 of each, I'm sure I'd end up with 9 per set. My question is why the Cartesian join? JOIN must have a subtle difference to it in PIG than regular relational databases. Cameron Walker Americas IT | AIG Property Casualty Senior DBA, Enterprise Application Technology Services 200 South College 13th Floor Charlotte, NC 28202 (Off): 704-338-7423 (Cell): 980-201-0496 mailto:[email protected] | http://www.aig.com
