Hi Nitin, The ddl is as follows:
CREATE EXTERNAL TABLE user_logs( users_iduuid string, siteid int, site_catid int, stext string, catg int, // CATEGORY scatg int, // SUBCATEGORY catgname string, scatgname string, brand string, // PRODUCT BRAND NAME prrange string, curr int, pname string, // product name pid int, // product ID price string, //Product Price prodnbr int, mrp string, //MRP prURL string, //Product url prIMGURL string, //Product Image URL opr string, oid string, txsucc string, last_updated string //timestamp ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' I am looking for an output where I have top 10 products from each subcategory(on the basis of count) with all their information like product name, price, url, imgurl. Any there will be multiple entries for the same products (pids) within the same subcategory, In that case I have to pick the product info that is latest(by last_updated field). I have written a query but it is considering a multiple entries of product as different products If price or any other info changes for that product. select siteid,site_catid,catg,scatg,COLLECT_SET(PRODDESC) from ( select PRODDESC,displays,siteid,site_catid,catg,scatg,rank(siteid,site_catid,catg,scatg) as row_number from ( select count(*) as displays,siteid,site_catid,catg,scatg,CONCAT('{','pname:',pname,',price:',price,',','mrp:',mrp,',curr:',curr,',pid:',pid,'}') as PRODDESC from user_logs group by siteid,site_catid,catg,scatg,pid,pname,price,mrp,curr order by siteid,site_catid,catg,scatg,displays desc ) A ) B WHERE row_number < 10 group by siteid,site_catid,catg,scatg order by siteid,site_catid,catg,scatg desc; The rank() method simply helps in fetching top 10 within a subcategory. Every time it encounters the same combination of siteid,site_catid,catg,scatg it increments row_number goes till 10. The problem above is that I am forced to put product info such as "pname,price,mrp,...." in the group by clause otherwise I will not be able to get that information in "select". Therefore, even if someone changes just the price a product(this happens very frequently) it is considered a different product by the above query. And that is something I don't want. I hope I have made it a little more clear? Thanks for your reply :) On Fri, Apr 11, 2014 at 12:45 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: > may be you can share your table ddl, your query and what output r u > looking for > > > On Fri, Apr 11, 2014 at 12:26 PM, Mohit Durgapal > <durgapalmo...@gmail.com>wrote: > >> I have a hive table partitioned by dates. It contains ecomm data in the >> format siteid,sitecatid,catid,subcatgid,pid,pname,pprice,pmrp,pdesc.... >> >> >> >> What I need to do is to run a query on table above in hive for top 10 >> products(count wise) in each sub category. What adds a bit more complexity >> is that I need all the information of the product. Now when I do group by >> with only subcatg,pid, I can only select the same fields. But I want all >> the data for that product coming in the same row as subcatg & prodid like >> prodname, proddesc,price, mrp,imageurl. And since some information like >> price & proddesc of a product keep on changing I want to pick the latest >> column values(according to a date field) for a pid if we are able to do a >> group by on subcatg,pid. >> >> >> I am not able to find a solution to my problem in hive. Any help would be >> much appreciated. >> >> >> Regards >> Mohit >> > > > > -- > Nitin Pawar >