Sorry, my mistake. I didn't pay attention that you are using cross join. Yes, cross join will always use one reducer, at least that is my understand. Yong
Date: Wed, 5 Mar 2014 15:27:48 +0100 Subject: Re: Best way to avoid cross join From: darkwoll...@gmail.com To: user@hive.apache.org hey Yong, Even without the group by (pure cross join) the query is only using one reducer. Even specifying more reducers doesn't help: set mapred.reduce.tasks=50; SELECT id1, m.keyword, prep_kw.keyword FROM (select id1, keyword from import1) m CROSS JOIN (SELECT keyword FROM et_keywords) prep_kw; ... Hadoop job information for Stage-1: number of mappers: 3; number of reducers: 1 What could be setup wrong here? Or can it be avoided to use this ugly cross join at all? I mean my original problem is actually something else ;-) CheersWolli 2014-03-05 15:07 GMT+01:00 java8964 <java8...@hotmail.com>: Hi, Wolli: Cross join doesn't mean Hive has to use one reduce. >From query point of view, the following cases will use one reducer: 1) Order by in your query (Instead of using sort by)2) Only one reducer group, which means all the data have to send to one reducer, as there is only one reducer group. In your case, distinct count of id1 will be the reducer group count. Did you explicitly set the reducer count in your hive session? Yong Date: Wed, 5 Mar 2014 14:17:24 +0100 Subject: Best way to avoid cross join From: darkwoll...@gmail.com To: user@hive.apache.org Hey everyone, before i write a lot of text, i just post something which is already written:http://www.sqlservercentral.com/Forums/Topic1328496-360-1.aspx The first posts adresses a pretty similar problem i also have. Currently my implementation looks like this: SELECT id1, MAX( CASE WHEN m.keyword IS NULL THEN 0 WHEN instr(m.keyword, prep_kw.keyword) > 0 THEN 1 ELSE 0 END) AS flagFROM (select id1, keyword from import1) m CROSS JOIN (SELECT keyword FROM et_keywords) prep_kw GROUP BY id1; Since there is a cross join involved, the execution gets pinned down to 1 reducer only and it takes ages to complete. The thread i posted is solving this with some special SQLserver tactics. But I was wondering if anybody has encountered the problem in Hive already and found a better way to solve this. I'm using Hive 0.11 on a MapR Distribution, if this is somehow important. CheersWolli