Hi, I had initially forwarded this request to the user group but am yet to receive any response.
I will be grateful if someone can help me out in resolving the issue or pointing out any mistakes that I may be doing. It took me around 5 to 6 hours to generate the test data of around 20 GB (or more) and there must be a better alternative. Regards, Gourav ---------- Forwarded message ---------- From: Gourav Sengupta <gourav.had...@gmail.com> Date: Mon, Jun 10, 2013 at 4:10 PM Subject: GROUP BY Issue To: u...@hive.apache.org Hi, On running the following query I am getting multiple records with same value of F1 SELECT F1, COUNT(*) FROM ( SELECT F1, F2, COUNT(*) FROM TABLE1 GROUP BY F1, F2 ) a GROUP BY F1; As per what I understand there are multiple number of records based on number of reducers. Replicating the test scenario: STEP1: get the dataset as available in http://snap.stanford.edu/data/amazon0302.html STEP2: Open the file and delete the heading STEP3: hadoop fs -mkdir /test STEP4: hadoop fs -put amazon0302.txt /test STEP5: create external table test (f1 int, f2 int) row format delimited fields terminated by '\t' lines terminated by '\n' stored as textfile location '/test'; STEP6: create table test1 location '/test1' as select left_table.* from (select * from test where f1<10000) left_table join (select * from test where f1 < 10000) right_table; STEP7: hadoop fs -mkdir /test2 STEP8: create table test2 location '/test2' as select f1, count(*) from (select f1, f2, count(*) from test1 group by f1, f2) a group by f1; STEP9: select * from test2 where f1 = 9887; ENVIRONMENT: HADOOP 2.0.4 HIVE 0.11 Please do let me know whether I am doing anything wrong. Thanks and Regards, Gourav Sengupta