best way to do all this would be run a distinct and group by along side a join (its just a guess but a more detailed approach other guys will suggest )
On Mon, May 6, 2013 at 11:57 PM, Peter Chu <pete....@outlook.com> wrote: > Thanks Nitin and Michael, > > The reason I asked is because I cannot help but wonder if it takes extra > time with all those group by columns. > > Say for example, I have a employees table with 10 columns pertaining to > employees but there could be duplicates, I need to de dup it by performing > a group by employee id, and hire date, but were I want to select the other > fields as well, I would have to add those fields in the group by clause, or > join it back in another table > > When the table is big, I cannot help but wonder if it hurts the > performance. > > Peter > > ------------------------------ > Date: Mon, 6 May 2013 23:42:23 +0530 > Subject: Re: Hive Group By Limitations > From: nitinpawar...@gmail.com > To: user@hive.apache.org > > > hi Peter, > > In hive if you are running a group by, then all the select columns have to > be in the group by clause. This limitation is for the column definition > only and not for the column operations like count etc > > All the columns for group by do go to a single map reduce job and it does > not launch multiple mapreduce jobs for each group by. > > I am not sure what do you mean by better way? > > > > > On Mon, May 6, 2013 at 11:37 PM, Peter Chu <pete....@outlook.com> wrote: > > In Hive, I cannot perform a SELECT GROUP BY on fields not in the GROUP BY > clause. > > Example: SELECT st.a, st.b, st.c, st.d, FROM some_table st GROUP BY st.a; > -- This does not work. > > To make it work, I would need to add the other fields in the group by > clause. > > Not quite sure but I think each group by will give another M/R job. > > Wondering if there is any other way / better way to do group by. > > Peter > > > > > -- > Nitin Pawar > -- Nitin Pawar