So you have 50 columns and out of them you want to use 9 columns for finding unique rows?
am i correct in assuming that you want to make a key of combination of these 9 columns so that you have just one row for a single combination of these 9 columns ? On Wed, Oct 2, 2013 at 6:07 AM, Philo Wang <phil...@gmail.com> wrote: > Hi, > > I am using Hive 8.1.8 in EMR. > > We have an extremely large table (~50 columns) where the uniqueness key is > a combination of 9 different columns. I want to filter out any duplicate > rows based on these 9 columns while retaining the ability to select other > columns on an ad hoc basis. I don’t expect rows with the same uniqueness > key to have different data, so I guess this can be generalized to just > filtering out duplicate rows. > > My initial instinct was to do a “select distinct *” on the table and save > the results into another table, but it appears that Hive does not support > “distinct *”. Furthermore, Hive will apply distinct to every column in the > select statement, so something like “select distinct(a), b” does not work > either. > > The only option I could think of from here was to explicitly state all > columns of the table inside the distinct statement, but this seems > unnecessarily messy (again, the table contains more than 50 columns). > > Has anyone ran into a similar issue? Any insight would be appreciated. > > Thanks, > Philo > > -- Nitin Pawar