best way to do all this would be run a distinct and group by along side a
join (its just a  guess but a more detailed approach other guys will
suggest )


On Mon, May 6, 2013 at 11:57 PM, Peter Chu <pete....@outlook.com> wrote:

> Thanks Nitin and Michael,
>
> The reason I asked is because I cannot help but wonder if it takes extra
> time with all those group by columns.
>
> Say for example, I have a employees table with 10 columns pertaining to
> employees but there could be duplicates, I need to de dup it by performing
> a group by employee id, and hire date, but were I want to select the other
> fields as well, I would have to add those fields in the group by clause, or
> join it back in another table
>
> When the table is big, I cannot help but wonder if it hurts the
> performance.
>
> Peter
>
> ------------------------------
> Date: Mon, 6 May 2013 23:42:23 +0530
> Subject: Re: Hive Group By Limitations
> From: nitinpawar...@gmail.com
> To: user@hive.apache.org
>
>
> hi Peter,
>
> In hive if you are running a group by, then all the select columns have to
> be in the group by clause. This limitation is for the column definition
> only and not for the column operations like count etc
>
> All the columns for group by do go to a single map reduce job and it does
> not launch multiple mapreduce jobs for each group by.
>
> I am not sure what do you mean by better way?
>
>
>
>
> On Mon, May 6, 2013 at 11:37 PM, Peter Chu <pete....@outlook.com> wrote:
>
> In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY
> clause.
>
> Example: SELECT st.a, st.b, st.c, st.d, FROM some_table st GROUP BY st.a;
>  -- This does not work.
>
> To make it work, I would need to add the other fields in the group by
> clause.
>
> Not quite sure but I think each group by will give another M/R job.
>
> Wondering if there is any other way / better way to do group by.
>
> Peter
>
>
>
>
> --
> Nitin Pawar
>



-- 
Nitin Pawar

Reply via email to