will it be a good idea to just get top 10 ranked products by whatever your ranking is based on and then join it with its metadata (self join or any other way) ?
On Fri, Apr 11, 2014 at 1:52 PM, Mohit Durgapal <durgapalmo...@gmail.com>wrote: > Hi Nitin, > > The ddl is as follows: > > CREATE EXTERNAL TABLE user_logs( > users_iduuid string, > siteid int, > site_catid int, > stext string, > catg int, // CATEGORY > scatg int, // SUBCATEGORY > catgname string, > scatgname string, > brand string, // PRODUCT BRAND NAME > prrange string, > curr int, > pname string, // product name > pid int, // product ID > price string, //Product Price > prodnbr int, > mrp string, //MRP > prURL string, //Product url > prIMGURL string, //Product Image URL > opr string, > oid string, > txsucc string, > last_updated string //timestamp > ) > > ROW FORMAT DELIMITED > FIELDS TERMINATED BY ',' > > > I am looking for an output where I have top 10 products from each > subcategory(on the basis of count) with all their information like product > name, price, url, imgurl. Any there will be multiple entries for the same > products (pids) within the same subcategory, In that case I have to pick > the product info that is latest(by last_updated field). > > > I have written a query but it is considering a multiple entries of > product as different products If price or any other info changes for that > product. > > > > select siteid,site_catid,catg,scatg,COLLECT_SET(PRODDESC) from > ( > select > PRODDESC,displays,siteid,site_catid,catg,scatg,rank(siteid,site_catid,catg,scatg) > as row_number from > ( > select count(*) as > displays,siteid,site_catid,catg,scatg,CONCAT('{','pname:',pname,',price:',price,',','mrp:',mrp,',curr:',curr,',pid:',pid,'}') > as PRODDESC from > user_logs group by siteid,site_catid,catg,scatg,pid,pname,price,mrp,curr > order by siteid,site_catid,catg,scatg,displays desc > ) A > ) B > WHERE row_number < 10 > group by siteid,site_catid,catg,scatg > order by siteid,site_catid,catg,scatg desc; > > The rank() method simply helps in fetching top 10 within a subcategory. > Every time it encounters the same combination of > siteid,site_catid,catg,scatg it increments row_number goes till 10. > > The problem above is that I am forced to put product info such as > "pname,price,mrp,...." in the group by clause otherwise I will not be able > to get that information in "select". Therefore, even if someone changes > just the price a product(this happens very frequently) it is considered a > different product by the above query. And that is something I don't want. > > I hope I have made it a little more clear? Thanks for your reply :) > > > > On Fri, Apr 11, 2014 at 12:45 PM, Nitin Pawar <nitinpawar...@gmail.com>wrote: > >> may be you can share your table ddl, your query and what output r u >> looking for >> >> >> On Fri, Apr 11, 2014 at 12:26 PM, Mohit Durgapal <durgapalmo...@gmail.com >> > wrote: >> >>> I have a hive table partitioned by dates. It contains ecomm data in the >>> format siteid,sitecatid,catid,subcatgid,pid,pname,pprice,pmrp,pdesc.... >>> >>> >>> >>> What I need to do is to run a query on table above in hive for top 10 >>> products(count wise) in each sub category. What adds a bit more complexity >>> is that I need all the information of the product. Now when I do group by >>> with only subcatg,pid, I can only select the same fields. But I want all >>> the data for that product coming in the same row as subcatg & prodid like >>> prodname, proddesc,price, mrp,imageurl. And since some information like >>> price & proddesc of a product keep on changing I want to pick the latest >>> column values(according to a date field) for a pid if we are able to do a >>> group by on subcatg,pid. >>> >>> >>> I am not able to find a solution to my problem in hive. Any help would >>> be much appreciated. >>> >>> >>> Regards >>> Mohit >>> >> >> >> >> -- >> Nitin Pawar >> > > -- Nitin Pawar