Thank you Mehmet, that was very helpful, Best,
Rowan On Sun, Jul 19, 2015 at 6:33 PM, Mehmet Tepedelenlioglu < [email protected]> wrote: > First join: > > joined = join data1 by int_2, data2 by int_1 > > where data1 and data2 are the same set. > > then group by the first field. The inner bag will have all the connections > to the 'group', possibly more than once. So you might need a distinct on > the inner bags as well, if you just one the unique elements. > > > On Jul 19, 2015, at 12:49 PM, Rowan <[email protected]> wrote: > > > > Hi, I have a question about self-joining two bags. I have some set of > > numbers that describes connections between the first set of integers and > > the second set of integers. For example: > > > > 1,2 > > 3,4 > > 5,6 > > 5,7 > > 6,8 > > > > I then load my data as follows, and group it: > > > > data = load 'data.csv' as integer_1, integer_2; > > grouped = group data by integer_1; > > > > grouped_numbers = foreach grouped generate group as node, > > data.integer_2 as connection; > > > > Which then yields a bag with each first integer and its first-degree > > connections: > > > > (1,{(2)}) > > (3,{(4)}) > > (5,{(6),(7)}) > > (6,{(8)}) > > > > I would then like to do a self-join of the grouped_numbers bag, in order > to > > give the resultant first integer with each of its first- and > second-degree > > connections. In this case, that would be: > > > > (1,{(2)}) > > (3,{(4)}) > > (5,{(6),(7),(8)}) > > (6,{(8)}) > > > > because 5 is connected to 6, which is connected to 8, so 8 is a > > second-degree connection of 6. Is there a way to implement this in Pig? > > > > > > Best, > > > > > > Rowan > >
