Thank you Mehmet, that was very helpful,

Best,

Rowan

On Sun, Jul 19, 2015 at 6:33 PM, Mehmet Tepedelenlioglu <
[email protected]> wrote:

> First join:
>
> joined = join data1 by int_2, data2 by int_1
>
> where data1 and data2 are the same set.
>
> then group by the first field. The inner bag will have all the connections
> to the 'group', possibly more than once. So you might need a distinct on
> the inner bags as well, if you just one the unique elements.
>
> > On Jul 19, 2015, at 12:49 PM, Rowan <[email protected]> wrote:
> >
> > Hi, I have a question about self-joining two bags. I have some set of
> > numbers that describes connections between the first set of integers and
> > the second set of integers. For example:
> >
> > 1,2
> > 3,4
> > 5,6
> > 5,7
> > 6,8
> >
> > I then load my data as follows, and group it:
> >
> > data = load 'data.csv' as integer_1, integer_2;
> > grouped = group data by integer_1;
> >
> > grouped_numbers = foreach grouped generate group as node,
> > data.integer_2 as connection;
> >
> > Which then yields a bag with each first integer and its first-degree
> > connections:
> >
> > (1,{(2)})
> > (3,{(4)})
> > (5,{(6),(7)})
> > (6,{(8)})
> >
> > I would then like to do a self-join of the grouped_numbers bag, in order
> to
> > give the resultant first integer with each of its first- and
> second-degree
> > connections. In this case, that would be:
> >
> > (1,{(2)})
> > (3,{(4)})
> > (5,{(6),(7),(8)})
> > (6,{(8)})
> >
> > because 5 is connected to 6, which is connected to 8, so 8 is a
> > second-degree connection of 6. Is there a way to implement this in Pig?
> >
> >
> > Best,
> >
> >
> > Rowan
>
>

Reply via email to