On Fri, Oct 21, 2011 at 10:21 AM, john smith <js1987.sm...@gmail.com> wrote:

> Hi Edward,
>
> Thanks for replying. I have been using the query
>
> "select a,b from a,b where a.id=b.id ".  According to my knowledge of
> Hive, it reads data of both A and B and emits <join_key,rowid/required row
> data> pairs as map outputs and then performs cartesian joins on reduce side
> for the same join_keys .
>
> Is this the cartesian join you are referring to? or Is it the cartesian
> product of the total table (as in sql) ? or Am I missing something?
>
> Can you please throw some light on the functionality of mapred.mode=strict
> ?
>
> Thanks,
> jS
>
> On Fri, Oct 21, 2011 at 7:29 PM, Edward Capriolo <edlinuxg...@gmail.com>wrote:
>
>>
>>
>> On Fri, Oct 21, 2011 at 9:22 AM, john smith <js1987.sm...@gmail.com>wrote:
>>
>>> Hi list,
>>>
>>> I am also facing the same problem. My reducers hang at this position and
>>> it takes hours to complete a single reduce task. Can any hive guru help us
>>> out with this issue.
>>>
>>> Thanks,
>>> jS
>>>
>>> 2011/10/21 bangbig <lizhongliangg...@163.com>
>>>
>>>> HI all,
>>>>
>>>> HIVE runs too slowly when it is doing such things(see the log below), 
>>>> what's the problem? because I'm joining two large table?
>>>>
>>>> it runs pretty fast at first. when the job finishes 95%, it begins to slow 
>>>> down.
>>>>
>>>> --------------------------------------------------
>>>>
>>>> INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1044000000 
>>>> rows
>>>> 2011-10-21 16:55:57,427 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1045000000 rows
>>>> 2011-10-21 16:55:57,545 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1046000000 rows
>>>> 2011-10-21 16:55:57,686 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1047000000 rows
>>>> 2011-10-21 16:55:57,806 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1048000000 rows
>>>> 2011-10-21 16:55:57,926 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1049000000 rows
>>>> 2011-10-21 16:55:58,045 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1050000000 rows
>>>> 2011-10-21 16:55:58,164 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1051000000 rows
>>>> 2011-10-21 16:55:58,284 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1052000000 rows
>>>> 2011-10-21 16:55:58,405 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1053000000 rows
>>>> 2011-10-21 16:55:58,525 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1054000000 rows
>>>> 2011-10-21 16:55:58,644 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1055000000 rows
>>>> 2011-10-21 16:55:58,764 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1056000000 rows
>>>> 2011-10-21 16:55:58,883 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1057000000 rows
>>>> 2011-10-21 16:55:59,003 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1058000000 rows
>>>> 2011-10-21 16:55:59,122 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1059000000 rows
>>>> 2011-10-21 16:55:59,242 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1060000000 rows
>>>> 2011-10-21 16:55:59,361 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1061000000 rows
>>>> 2011-10-21 16:55:59,482 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1062000000 rows
>>>> 2011-10-21 16:55:59,601 INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 
>>>> 4 forwarding 1063000000 rows
>>>>
>>>>
>>>>
>>>>
>>>
>> It is hard to say without seeing the query, the table definition, and the
>> explain. Please send the query. Although I have a theory:
>>
>> This query is not good:
>> select a,b from a,b where a.id=b.id
>> It does a Cart join.
>>
>> This query is better.
>> select a,b from a inner join b on (a.id=b.id)
>>
>> Consider setting in your hive-site.xml
>>
>> hive.mapred.mode=strict
>>
>> It can prevent you from running dangerous queries.
>>
>>
>
To be clear:

Do NOT join this way (it results in a cartesian product):

select a,b from a,b where a.id=b.id

Join this way:

select a,b from a join b on (a.id=b.id)

Also:
set hive.mapred.mode=strict in your hive-site.xml to prevent yourself from
mistakenly doing cartesian products and other bad ideas.

Reply via email to