Re: Ordered partitioned data

Ahmed Eldawy Mon, 13 May 2013 12:21:59 -0700

Thanks Cheolsoo for your help. I'm still learning Pig and I didn't know
about this nested structure. I'll try it and see how much performance I
gain compared to my naive implementation.


Best regards,
Ahmed Eldawy


On Mon, May 13, 2013 at 12:18 PM, Cheolsoo Park <[email protected]>wrote:

> Hi Ahmed,
>
> Please try this:
>
> grped = GROUP foo BY group_id;
> sorted = FOREACH grped {
>     ordered = ORDER foo BY position;
>     GENERATE group, MyUDF(ordered.name); -- MyUDF concatenates strings in
> a
> bag
> };
>
> What this will do is:
> 1) Mappers will send the same keys to a reducer.
> 2) Each reducer will only sort values of their keys.
>
> In fact, it is possible for Pig to optimize this even further
> using secondary key sort optimization (i.e. Pig can remove ORDER BY in
> reducers and entirely rely on Hadoop secondary sorting instead). But there
> were some bugs with secondary key sort optimization for this case, and it
> is removed from trunk recently.
>
> Thanks,
> Cheolsoo
>
>
>
>
>
>
>
>
>
>
>
> On Mon, May 13, 2013 at 7:52 AM, Ahmed Eldawy <[email protected]> wrote:
>
> > Hi,
> >   I have a dataset with two three columns, group_id, position, and name.
> I
> > need for each group to generate a concatenated string of all names
> ordered
> > by their position. I can do this by sorting all data based on position,
> (or
> > group_id and position), then grouping them by group_id, and finally
> > concatenating names in each group. I have two questions here,
> > 1- Does this really work? In other words, does the GROUP BY operator
> retain
> > order?
> > 2- What is the most efficient way to do it? Is it better, if possible, to
> > group first and then sort?  Let's say I order by the pair (group_id,
> > position) first, can this be hinted to Pig to make the group by faster.
> > Thanks for your help
> >
> >
> > Best regards,
> > Ahmed Eldawy
> >
>

Re: Ordered partitioned data

Reply via email to