Thanks Cheolsoo for your help. I'm still learning Pig and I didn't know about this nested structure. I'll try it and see how much performance I gain compared to my naive implementation.
Best regards, Ahmed Eldawy On Mon, May 13, 2013 at 12:18 PM, Cheolsoo Park <[email protected]>wrote: > Hi Ahmed, > > Please try this: > > grped = GROUP foo BY group_id; > sorted = FOREACH grped { > ordered = ORDER foo BY position; > GENERATE group, MyUDF(ordered.name); -- MyUDF concatenates strings in > a > bag > }; > > What this will do is: > 1) Mappers will send the same keys to a reducer. > 2) Each reducer will only sort values of their keys. > > In fact, it is possible for Pig to optimize this even further > using secondary key sort optimization (i.e. Pig can remove ORDER BY in > reducers and entirely rely on Hadoop secondary sorting instead). But there > were some bugs with secondary key sort optimization for this case, and it > is removed from trunk recently. > > Thanks, > Cheolsoo > > > > > > > > > > > > On Mon, May 13, 2013 at 7:52 AM, Ahmed Eldawy <[email protected]> wrote: > > > Hi, > > I have a dataset with two three columns, group_id, position, and name. > I > > need for each group to generate a concatenated string of all names > ordered > > by their position. I can do this by sorting all data based on position, > (or > > group_id and position), then grouping them by group_id, and finally > > concatenating names in each group. I have two questions here, > > 1- Does this really work? In other words, does the GROUP BY operator > retain > > order? > > 2- What is the most efficient way to do it? Is it better, if possible, to > > group first and then sort? Let's say I order by the pair (group_id, > > position) first, can this be hinted to Pig to make the group by faster. > > Thanks for your help > > > > > > Best regards, > > Ahmed Eldawy > > >
