Re: "order by" and "distinct" in one job?

Michael Howard Mon, 08 Jun 2015 13:31:16 -0700

Rohini wrote:
> But the current distributed order by uses range
> partitioning and same keys can go to different reducers.

I am puzzled by this statement.
I don't understand how partitioning would work if the same keys go to
different reducers.

(I suppose I could understand if it were 'two' different reducers, because
a specific key may be the max of one range and the min of the next. But
even in that case I can't understand why it would be desirable to be make
both ends of the ranges 'inclusive'. I think I'm heading down the wrong
path ... ?)

To help educate us newbies, please provide a more detailed explanation ...

Q: Is there a URL which could explain "range partitioning ... same keys go
to different reducers" ?

and/or

Q: Can you provide a brief explanation of "range partitioning ... same keys
go to different reducers" ?

Michael

On Mon, Jun 8, 2015 at 4:08 PM, Rohini Palaniswamy <[email protected]>
wrote:

> If order by and distinct have the same key, it is possible to combine them
> into one mapreduce job.  But the current distributed order by uses range
> partitioning and same keys can go to different reducers. Tagging along
> distinct to that will require more work and not something we are planning
> to do sometime soon.
>
> On Wed, Jun 3, 2015 at 11:14 AM, Mehmet Tepedelenlioglu <
> [email protected]> wrote:
>
> > Order and distinct are 2 very different operations. You order by
> > something, but you take the distinct over all the fields of a relation,
> > which is to say that the key/value structure is quite different for the
> > general case.
> >
> >
> > > On Jun 3, 2015, at 11:02 AM, <[email protected]> <
> > [email protected]> wrote:
> > >
> > > Dear Pig users,
> > > Can Pig combine sorting and unique-ing into a single job?  Doing this
> > > --define Components, then
> > > Sorted_0 = order Components by block_id parallel $par;
> > > Sorted = DISTINCT Sorted_0;
> > >
> > > causes one more MR job to be launched than simply doing this:
> > > --define Components, then
> > > Sorted = order Components by block_id parallel $par;
> > >
> > > It would seem there should be some way to do the distinct in the same
> > pass as the sort, like 'sort -u'.  But I can't see how. Any tips would be
> > much appreciated!
> > >
> > > Thanks,
> > > Will
> > >
> > > William F Dowling
> > > Senior Technologist
> > > Thomson Reuters
> > >
> >
> >
>

Re: "order by" and "distinct" in one job?

Reply via email to