Thanks Appan for verifying. I will do some more tests on my side too and let you know the results.
I tried a different version of the query where I join'ed two sub-queries for the same partitions and the data comes out to be correct. I will see if I can post the real-world example to the list, because that might sound like a more practical example. If you still have your example(s) do you mind sending me your query-plan for select t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from table_a t1 join table_b t2 on t1.part_col = t2.part_col and t1.common_id = t2.common_id where t1.part_col >= 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string; -Viral On Wed, Jan 19, 2011 at 10:36 AM, Appan Thirumaligai < athirumali...@ngmoco.com> wrote: > Viral, > > I tried the queries below (similar to yours) and I get the expected results > when I do the join. I ran my queries after building hive from the latest > source and hadoop 0.20+. > create table table_a(a_id bigint, common_id bigint, some_string > string,total_count bigint) partitioned by (part_col string) ROW FORMAT > DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS > TEXTFILE; > create table table_b(b_id bigint, common_id bigint, some_string > string,total_count bigint) partitioned by (part_col string) ROW FORMAT > DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n' STORED AS > TEXTFILE; > dfs -mkdir /user/data/table_a; > dfs -mkdir /user/data/table_b; > dfs -put /home/training/hiveug/table_a.csv /user/data/table_a; > dfs -put /home/training/hiveug/table_b.csv /user/data/table_b; > alter table table_a add partition (part_col = 'mypart') location > '/user/data/table_a'; > alter table table_b add partition (part_col = 'mypart') location > '/user/data/table_b'; > select * from table_a t1 join table_b t2 on t1.part_col == t2.part_col; > -->> Returns expected result > select > t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from > table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= > 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_string; > --->>Works fine. > select > t1.some_string,t2.some_string,sum(t1.total_count),sum(t2.total_count) from > table_a t1 join table_b t2 on t1.part_col = t2.part_col where t1.part_col >= > 'mypart' and t2.part_col >= 'mypart' group by t1.some_string,t2.some_st* > from table_a t1 join table_b t2 on t1.part_col = t2.part_col where > t1.part_col >= 'mypart' and t2.part_col >= 'mypart'; > --->Works fine. > > I created the two files with sample data in them and copied it to hdfs > > I'll try later on your hive 0.5.0 but looks like there might be something > wrong in your query. > > On Jan 18, 2011, at 8:40 PM, Ajo Fod wrote: > > Can you try this with a dummy table with very few rows ... to see if > the reason the script doesn't finish is a computational issue? > > One other thing is to try with a combined partition, to see if it is a > problem with the partitioning. > > Also, take a look at the results of an EXPLAIN statement, see if > there are any hints there. > > NOTE: I'm new to hive too. > > -Ajo > > > On Tue, Jan 18, 2011 at 8:08 PM, Viral Bajaria <viral.baja...@gmail.com> > wrote: > > I haven't heard back from any on the list and am still struggling to join > > two tables on partitioned column > > > Has anyone every tried joining two tables on a paritioned column and the > > results are not as expected ? > > On Tue, Jan 18, 2011 at 2:04 AM, Viral Bajaria <viral.baja...@gmail.com> > > wrote: > > > I am facing issues with a query where I am joining two fairly large > tables > > on the partitioned column along with other common columns. The expected > > output is not in line with what I expect it to be. Since the query is > very > > complex, I will simplify it so that people can provide inputs if they > have > > faced similar issues or if I am doing something totally wrong. > > TABLE A: > > a_id bigint > > common_id bigint > > some_string string > > total_count bigint > > part_col string <---- this is the partitioned column > > TABLE B: > > b_int bigint > > common_id bigint > > some_string string > > total_sum bigint > > part_col string <---- this is the partitioned column > > now the query is as follows: > > SELECT /*+ STREAMTABLE(A,B) */ A.some_string, B.some_string, > > sum(A.total_count), sum(B.total_sum) from A JOIN B ON (t1.part_col = > > t2.part_col AND t1.common_id = t2.common_id) WHERE t1.part_col >= 'val1' > AND > > t2.part_col >= 'val1' GROUP BY A.some_string, B.some_string > > Does HIVE not like to join on the partitioned columns ? because when i > > create a join on just the partitioned column the reduce step never > finishes. > > I am using HIVE 0.5.0 > > Thanks, > > Viral > > > > Appan Thirumaligai > ap...@ngmoco.com > Ph:1-818-472-8427 > ngmoco:) > >