Sorry, my mistake. I didn't pay attention that you are using cross join.
Yes, cross join will always use one reducer, at least that is my understand.
Yong

Date: Wed, 5 Mar 2014 15:27:48 +0100
Subject: Re: Best way to avoid cross join
From: darkwoll...@gmail.com
To: user@hive.apache.org

hey Yong,
Even without the group by (pure cross join) the query is only using one 
reducer. Even specifying more reducers doesn't help:
set mapred.reduce.tasks=50;

SELECT id1,       m.keyword, 
       prep_kw.keyword
FROM (select id1, keyword from import1) m
CROSS JOIN  (SELECT keyword FROM et_keywords) prep_kw;

...

Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1


What could be setup wrong here? Or can it be avoided to use this ugly cross 
join at all? I mean my original problem is actually something else ;-)

CheersWolli


2014-03-05 15:07 GMT+01:00 java8964 <java8...@hotmail.com>:




Hi, Wolli:
Cross join doesn't mean Hive has to use one reduce.
>From query point of view, the following cases will use one reducer:

1) Order by in your query (Instead of using sort by)2) Only one reducer group, 
which means all the data have to send to one reducer, as there is only one 
reducer group.
In your case, distinct count of id1 will be the reducer group count. Did you 
explicitly set the reducer count in your hive session?

Yong 
Date: Wed, 5 Mar 2014 14:17:24 +0100
Subject: Best way to avoid cross join
From: darkwoll...@gmail.com
To: user@hive.apache.org


Hey everyone,
before i write a lot of text, i just post something which is already 
written:http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx



The first posts adresses a pretty similar problem i also have. Currently my 
implementation looks like this:
SELECT id1,

  MAX(  CASE    WHEN m.keyword IS NULL    THEN 0

    WHEN instr(m.keyword, prep_kw.keyword) > 0    THEN 1

    ELSE 0  END) AS flagFROM (select id1, keyword from import1) m

CROSS JOIN  (SELECT keyword FROM et_keywords) prep_kw

GROUP BY id1;
Since there is a cross join involved, the execution gets pinned down to 1 
reducer only and it takes ages to complete. 


The thread i posted is solving this with some special SQLserver tactics. But I 
was wondering if anybody has encountered the problem in Hive already and found 
a better way to solve this.


I'm using Hive 0.11 on a MapR Distribution, if this is somehow important.
CheersWolli                                       

                                          

Reply via email to