On Fri, Oct 21, 2011 at 10:21 AM, john smith <js1987.sm...@gmail.com> wrote:
> Hi Edward, > > Thanks for replying. I have been using the query > > "select a,b from a,b where a.id=b.id ". According to my knowledge of > Hive, it reads data of both A and B and emits <join_key,rowid/required row > data> pairs as map outputs and then performs cartesian joins on reduce side > for the same join_keys . > > Is this the cartesian join you are referring to? or Is it the cartesian > product of the total table (as in sql) ? or Am I missing something? > > Can you please throw some light on the functionality of mapred.mode=strict > ? > > Thanks, > jS > > On Fri, Oct 21, 2011 at 7:29 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote: > >> >> >> On Fri, Oct 21, 2011 at 9:22 AM, john smith <js1987.sm...@gmail.com>wrote: >> >>> Hi list, >>> >>> I am also facing the same problem. My reducers hang at this position and >>> it takes hours to complete a single reduce task. Can any hive guru help us >>> out with this issue. >>> >>> Thanks, >>> jS >>> >>> 2011/10/21 bangbig <lizhongliangg...@163.com> >>> >>>> HI all, >>>> >>>> HIVE runs too slowly when it is doing such things(see the log below), >>>> what's the problem? because I'm joining two large table? >>>> >>>> it runs pretty fast at first. when the job finishes 95%, it begins to slow >>>> down. >>>> >>>> -------------------------------------------------- >>>> >>>> INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1044000000 >>>> rows >>>> 2011-10-21 16:55:57,427 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1045000000 rows >>>> 2011-10-21 16:55:57,545 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1046000000 rows >>>> 2011-10-21 16:55:57,686 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1047000000 rows >>>> 2011-10-21 16:55:57,806 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1048000000 rows >>>> 2011-10-21 16:55:57,926 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1049000000 rows >>>> 2011-10-21 16:55:58,045 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1050000000 rows >>>> 2011-10-21 16:55:58,164 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1051000000 rows >>>> 2011-10-21 16:55:58,284 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1052000000 rows >>>> 2011-10-21 16:55:58,405 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1053000000 rows >>>> 2011-10-21 16:55:58,525 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1054000000 rows >>>> 2011-10-21 16:55:58,644 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1055000000 rows >>>> 2011-10-21 16:55:58,764 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1056000000 rows >>>> 2011-10-21 16:55:58,883 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1057000000 rows >>>> 2011-10-21 16:55:59,003 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1058000000 rows >>>> 2011-10-21 16:55:59,122 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1059000000 rows >>>> 2011-10-21 16:55:59,242 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1060000000 rows >>>> 2011-10-21 16:55:59,361 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1061000000 rows >>>> 2011-10-21 16:55:59,482 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1062000000 rows >>>> 2011-10-21 16:55:59,601 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: >>>> 4 forwarding 1063000000 rows >>>> >>>> >>>> >>>> >>> >> It is hard to say without seeing the query, the table definition, and the >> explain. Please send the query. Although I have a theory: >> >> This query is not good: >> select a,b from a,b where a.id=b.id >> It does a Cart join. >> >> This query is better. >> select a,b from a inner join b on (a.id=b.id) >> >> Consider setting in your hive-site.xml >> >> hive.mapred.mode=strict >> >> It can prevent you from running dangerous queries. >> >> > To be clear: Do NOT join this way (it results in a cartesian product): select a,b from a,b where a.id=b.id Join this way: select a,b from a join b on (a.id=b.id) Also: set hive.mapred.mode=strict in your hive-site.xml to prevent yourself from mistakenly doing cartesian products and other bad ideas.