Hi,

I had initially forwarded this request to the user group but am yet to
receive any response.

I will be grateful if someone can help me out in resolving the issue or
pointing out any mistakes that I may be doing.

It took me around 5 to 6 hours to generate the test data of around 20 GB
(or more) and there must be a better alternative.

Regards,
Gourav

---------- Forwarded message ----------
From: Gourav Sengupta <gourav.had...@gmail.com>
Date: Mon, Jun 10, 2013 at 4:10 PM
Subject: GROUP BY Issue
To: u...@hive.apache.org


Hi,

On running the following query I am getting multiple records with same
value of F1

SELECT F1, COUNT(*)
FROM
(
SELECT F1, F2, COUNT(*)
FROM TABLE1
GROUP BY F1, F2
) a
GROUP BY F1;

As per what I understand there are multiple number of records based on
number of reducers.

Replicating the test scenario:
STEP1: get the dataset as available in
http://snap.stanford.edu/data/amazon0302.html

STEP2: Open the file and delete the heading

STEP3: hadoop fs -mkdir /test

STEP4: hadoop fs -put amazon0302.txt /test

STEP5: create external table test (f1 int, f2 int) row format delimited
fields terminated by '\t' lines terminated by '\n' stored as textfile
location '/test';

STEP6: create table test1 location '/test1' as select left_table.* from
(select * from test where f1<10000) left_table join (select * from test
where f1 < 10000) right_table;

STEP7: hadoop fs -mkdir /test2

STEP8: create table test2 location '/test2' as select f1, count(*) from
(select f1, f2, count(*) from test1 group by f1, f2) a group by f1;

STEP9: select * from test2 where f1 = 9887;

ENVIRONMENT:
HADOOP 2.0.4
HIVE 0.11

Please do let me know whether I am doing anything wrong.


Thanks and Regards,
Gourav Sengupta

Reply via email to