Re: hive runs slowly

Bennie Schut Mon, 24 Oct 2011 00:04:52 -0700

"inner join" is simply translated to "join" they are the same thing
(HIVE-2191)
I'm guessing he means removing the join from the where part of the query
and using the "select a,b from a join b on (a.id=b.id)" syntax.


On 10/22/2011 05:05 AM, john smith wrote:

You mean select a,b from a inner join b on (a.id <http://a.id/>=b.id<http://b.id/>) ? or Does those brackets make some difference? Becausethe inner keyword is no where mentioned in the language manualhttps://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins


Any hints?

On Fri, Oct 21, 2011 at 8:47 PM, Edward Capriolo<edlinuxg...@gmail.com <mailto:edlinuxg...@gmail.com>> wrote:




    On Fri, Oct 21, 2011 at 10:21 AM, john smith
    <js1987.sm...@gmail.com <mailto:js1987.sm...@gmail.com>> wrote:

        Hi Edward,

        Thanks for replying. I have been using the query

        "select a,b from a,b where a.id <http://a.id/>=b.id
        <http://b.id/> ".  According to my knowledge of Hive, it reads
        data of both A and B and emits <join_key,rowid/required row
        data> pairs as map outputs and then performs cartesian joins
        on reduce side for the same join_keys .

        Is this the cartesian join you are referring to? or Is it the
        cartesian product of the total table (as in sql) ? or Am I
        missing something?

        Can you please throw some light on the functionality of
        mapred.mode=strict ?

        Thanks,
        jS

        On Fri, Oct 21, 2011 at 7:29 PM, Edward Capriolo
        <edlinuxg...@gmail.com <mailto:edlinuxg...@gmail.com>> wrote:



            On Fri, Oct 21, 2011 at 9:22 AM, john smith
            <js1987.sm...@gmail.com <mailto:js1987.sm...@gmail.com>>
            wrote:

                Hi list,

                I am also facing the same problem. My reducers hang at
                this position and it takes hours to complete a single
                reduce task. Can any hive guru help us out with this
                issue.

                Thanks,
                jS

                2011/10/21 bangbig <lizhongliangg...@163.com
                <mailto:lizhongliangg...@163.com>>

                    HI all,

                    HIVE runs too slowly when it is doing such things(see the 
log below), what's the problem? because I'm joining two large table?

                    it runs pretty fast at first. when the job finishes 95%, it 
begins to slow down.

                    --------------------------------------------------

                    INFO org.apache.hadoop.hive.ql.exec.JoinOperator: 4 
forwarding 1044000000 rows
                    2011-10-21 16:55:57,427 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1045000000 rows
                    2011-10-21 16:55:57,545 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1046000000 rows
                    2011-10-21 16:55:57,686 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1047000000 rows
                    2011-10-21 16:55:57,806 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1048000000 rows
                    2011-10-21 16:55:57,926 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1049000000 rows
                    2011-10-21 16:55:58,045 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1050000000 rows
                    2011-10-21 16:55:58,164 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1051000000 rows
                    2011-10-21 16:55:58,284 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1052000000 rows
                    2011-10-21 16:55:58,405 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1053000000 rows
                    2011-10-21 16:55:58,525 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1054000000 rows
                    2011-10-21 16:55:58,644 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1055000000 rows
                    2011-10-21 16:55:58,764 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1056000000 rows
                    2011-10-21 16:55:58,883 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1057000000 rows
                    2011-10-21 16:55:59,003 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1058000000 rows
                    2011-10-21 16:55:59,122 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1059000000 rows
                    2011-10-21 16:55:59,242 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1060000000 rows
                    2011-10-21 16:55:59,361 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1061000000 rows
                    2011-10-21 16:55:59,482 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1062000000 rows
                    2011-10-21 16:55:59,601 INFO 
org.apache.hadoop.hive.ql.exec.JoinOperator: 4 forwarding 1063000000 rows





            It is hard to say without seeing the query, the table
            definition, and the explain. Please send the query.
            Although I have a theory:

            This query is not good:
            select a,b from a,b where a.id <http://a.id>=b.id
            <http://b.id>
            It does a Cart join.

            This query is better.
            select a,b from a inner join b on (a.id <http://a.id>=b.id
            <http://b.id>)

            Consider setting in your hive-site.xml

            hive.mapred.mode=strict

            It can prevent you from running dangerous queries.



    To be clear:

    Do NOT join this way (it results in a cartesian product):

    select a,b from a,b where a.id <http://a.id>=b.id <http://b.id>

    Join this way:

    select a,b from a join b on (a.id <http://a.id>=b.id <http://b.id>)

    Also:
    set hive.mapred.mode=strict in your hive-site.xml to prevent
    yourself from mistakenly doing cartesian products and other bad ideas.

Re: hive runs slowly

Reply via email to