Re: "order by" and "distinct" in one job?

Michael Howard Mon, 08 Jun 2015 14:29:24 -0700

Rohini wrote:
> Sorry for the confusion.

Thanks for prompt clarification!



Michael


On Mon, Jun 8, 2015 at 5:04 PM, Rohini Palaniswamy <[email protected]>
wrote:
>
> Actually ignore what I said as DISTINCT is always on the alias and not on
a
> key and what I said will apply only if there is only one field in the
> dataset. Sorry for the confusion. As Mehmet pointed out, your ORDER BY and
> DISTINCT are on different keys. Ordering is done on block_id, but distinct
> is done for the whole record. It is not possible to combine them into one
> mapreduce job as keys are different.
>
> Also in your example, you are trying to do DISTINCT after ORDER BY. Your
> results will not be sorted. Please put the DISTINCT before the ORDER BY.
>
> Regards,
> Rohini
>
> On Mon, Jun 8, 2015 at 1:29 PM, Michael Howard <[email protected]>
> wrote:
>
> > Rohini wrote:
> > > But the current distributed order by uses range
> > > partitioning and same keys can go to different reducers.
> >
> > I am puzzled by this statement.
> > I don't understand how partitioning would work if the same keys go to
> > different reducers.
> >
> > (I suppose I could understand if it were 'two' different reducers,
because
> > a specific key may be the max of one range and the min of the next. But
> > even in that case I can't understand why it would be desirable to be
make
> > both ends of the ranges 'inclusive'. I think I'm heading down the wrong
> > path ... ?)
> >
> > To help educate us newbies, please provide a more detailed explanation
...
> >
> > Q: Is there a URL which could explain "range partitioning ... same keys
go
> > to different reducers" ?
> >
> > and/or
> >
> > Q: Can you provide a brief explanation of "range partitioning ... same
keys
> > go to different reducers" ?
> >
> >
> > Michael
> >
> >
> > On Mon, Jun 8, 2015 at 4:08 PM, Rohini Palaniswamy <
> > [email protected]>
> > wrote:
> >
> > > If order by and distinct have the same key, it is possible to combine
> > them
> > > into one mapreduce job.  But the current distributed order by uses
range
> > > partitioning and same keys can go to different reducers. Tagging along
> > > distinct to that will require more work and not something we are
planning
> > > to do sometime soon.
> > >
> > > On Wed, Jun 3, 2015 at 11:14 AM, Mehmet Tepedelenlioglu <
> > > [email protected]> wrote:
> > >
> > > > Order and distinct are 2 very different operations. You order by
> > > > something, but you take the distinct over all the fields of a
relation,
> > > > which is to say that the key/value structure is quite different for
the
> > > > general case.
> > > >
> > > >
> > > > > On Jun 3, 2015, at 11:02 AM, <[email protected]>
<
> > > > [email protected]> wrote:
> > > > >
> > > > > Dear Pig users,
> > > > > Can Pig combine sorting and unique-ing into a single job?  Doing
this
> > > > > --define Components, then
> > > > > Sorted_0 = order Components by block_id parallel $par;
> > > > > Sorted = DISTINCT Sorted_0;
> > > > >
> > > > > causes one more MR job to be launched than simply doing this:
> > > > > --define Components, then
> > > > > Sorted = order Components by block_id parallel $par;
> > > > >
> > > > > It would seem there should be some way to do the distinct in the
same
> > > > pass as the sort, like 'sort -u'.  But I can't see how. Any tips
would
> > be
> > > > much appreciated!
> > > > >
> > > > > Thanks,
> > > > > Will
> > > > >
> > > > > William F Dowling
> > > > > Senior Technologist
> > > > > Thomson Reuters
> > > > >
> > > >
> > > >
> > >
> >

Re: "order by" and "distinct" in one job?

Reply via email to