Re: "order by" and "distinct" in one job?

Rohini Palaniswamy Mon, 08 Jun 2015 14:07:07 -0700

Actually ignore what I said as DISTINCT is always on the alias and not on a
key and what I said will apply only if there is only one field in the
dataset. Sorry for the confusion. As Mehmet pointed out, your ORDER BY and
DISTINCT are on different keys. Ordering is done on block_id, but distinct
is done for the whole record. It is not possible to combine them into one
mapreduce job as keys are different.


Also in your example, you are trying to do DISTINCT after ORDER BY. Your
results will not be sorted. Please put the DISTINCT before the ORDER BY.

Regards,
Rohini

On Mon, Jun 8, 2015 at 1:29 PM, Michael Howard <[email protected]>
wrote:

> Rohini wrote:
> > But the current distributed order by uses range
> > partitioning and same keys can go to different reducers.
>
> I am puzzled by this statement.
> I don't understand how partitioning would work if the same keys go to
> different reducers.
>
> (I suppose I could understand if it were 'two' different reducers, because
> a specific key may be the max of one range and the min of the next. But
> even in that case I can't understand why it would be desirable to be make
> both ends of the ranges 'inclusive'. I think I'm heading down the wrong
> path ... ?)
>
> To help educate us newbies, please provide a more detailed explanation ...
>
> Q: Is there a URL which could explain "range partitioning ... same keys go
> to different reducers" ?
>
> and/or
>
> Q: Can you provide a brief explanation of "range partitioning ... same keys
> go to different reducers" ?
>
>
> Michael
>
>
> On Mon, Jun 8, 2015 at 4:08 PM, Rohini Palaniswamy <
> [email protected]>
> wrote:
>
> > If order by and distinct have the same key, it is possible to combine
> them
> > into one mapreduce job.  But the current distributed order by uses range
> > partitioning and same keys can go to different reducers. Tagging along
> > distinct to that will require more work and not something we are planning
> > to do sometime soon.
> >
> > On Wed, Jun 3, 2015 at 11:14 AM, Mehmet Tepedelenlioglu <
> > [email protected]> wrote:
> >
> > > Order and distinct are 2 very different operations. You order by
> > > something, but you take the distinct over all the fields of a relation,
> > > which is to say that the key/value structure is quite different for the
> > > general case.
> > >
> > >
> > > > On Jun 3, 2015, at 11:02 AM, <[email protected]> <
> > > [email protected]> wrote:
> > > >
> > > > Dear Pig users,
> > > > Can Pig combine sorting and unique-ing into a single job?  Doing this
> > > > --define Components, then
> > > > Sorted_0 = order Components by block_id parallel $par;
> > > > Sorted = DISTINCT Sorted_0;
> > > >
> > > > causes one more MR job to be launched than simply doing this:
> > > > --define Components, then
> > > > Sorted = order Components by block_id parallel $par;
> > > >
> > > > It would seem there should be some way to do the distinct in the same
> > > pass as the sort, like 'sort -u'.  But I can't see how. Any tips would
> be
> > > much appreciated!
> > > >
> > > > Thanks,
> > > > Will
> > > >
> > > > William F Dowling
> > > > Senior Technologist
> > > > Thomson Reuters
> > > >
> > >
> > >
> >
>

Re: "order by" and "distinct" in one job?

Reply via email to