Re: Map side join

Souvik Banerjee Thu, 27 Dec 2012 15:06:12 -0800

Hi,

To conclude this thread I am summarizing my experiences. Correct me if
think // observed otherwise.


1) For Map side join you need to set the flag hive.auto.convert.join=true;
Map side join works well with multiple table and multiple Join condition.
2) You can change the size of the small table according to the RAM
available.
3) If you observe huge volume expansion during join operation the mappers
will take long time. I observed that mappers don't always report status, so
set timeout to high value so that the framework doesn't kill the ongoing
tasks. The mappers eventually completes and job ends successfully.
4) Bringing down the HDFS block size do launches more mappers and very
helpful in such cases where you observer real volume expansion during join.
But it might cause problem to other queries / hadoop jobs.

Thanks and regards,
Souvik.

On Thu, Dec 13, 2012 at 12:36 PM, Souvik Banerjee
<souvikbaner...@gmail.com>wrote:

> Thanks for the help.
> What I did earlier is that I changed the configuration in HDFS and created
> the table. I expected that the block size of the new Table to be of 32 MB.
> But I found that while using Cloudera Manager you need to deploy Change in
> Configuration of both the HDFS and Mapreduce. (I did it only for HDFS)
> Now I deleted the old table and recreated the same. Now I could launch
> more mappers.
> Thanks a lot once again. Will post you what happens with more mappers.
>
> Thanks and regards,
> Souvik.
>
>
> On Thu, Dec 13, 2012 at 12:06 PM, <bejoy...@yahoo.com> wrote:
>
>> **
>> Hi Souvik
>>
>> To have the new hdfs block size in effect on the already existing files,
>> you need to re copy them into hdfs.
>>
>> To play with the number of mappers you can set lesser value like 64mb for
>> min and max split size.
>>
>> Mapred.min.split.size and mapred.max.split.size
>>
>> Regards
>> Bejoy KS
>>
>> Sent from remote device, Please excuse typos
>> ------------------------------
>> *From: * Souvik Banerjee <souvikbaner...@gmail.com>
>> *Date: *Thu, 13 Dec 2012 12:00:16 -0600
>> *To: *<user@hive.apache.org>; <bejoy...@yahoo.com>
>> *Subject: *Re: Map side join
>>
>> Hi Bejoy,
>>
>> The input files are non-compressed text file.
>> There are enough free slots in the cluster.
>>
>> Can you please let me know can I increase the no of mappers?
>> I tried reducing the HDFS block size to 32 MB from 128 MB. I was
>> expecting to get more mappers. But still it's launching same no of mappers
>> like it was doing while the HDFS block size was 128 MB. I have enough map
>> slots available, but not being able to utilize those.
>>
>>
>> Thanks and regards,
>> Souvik.
>>
>>
>> On Thu, Dec 13, 2012 at 11:12 AM, <bejoy...@yahoo.com> wrote:
>>
>>> **
>>> Hi Souvik
>>>
>>> Is your input files compressed using some non splittable compression
>>> codec?
>>>
>>> Do you have enough free slots while this job is running?
>>>
>>> Make sure that the job is not running locally.
>>>
>>> Regards
>>> Bejoy KS
>>>
>>> Sent from remote device, Please excuse typos
>>> ------------------------------
>>> *From: * Souvik Banerjee <souvikbaner...@gmail.com>
>>> *Date: *Wed, 12 Dec 2012 14:27:27 -0600
>>> *To: *<user@hive.apache.org>; <bejoy...@yahoo.com>
>>> *ReplyTo: * user@hive.apache.org
>>> *Subject: *Re: Map side join
>>>
>>> Hi Bejoy,
>>>
>>> Yes I ran the pi example. It was fine.
>>> Regarding the HIVE Job what I found is that it took 4 hrs for the first
>>> map job to get completed.
>>> Those map tasks were doing their job and only reported status after
>>> completion. It is indeed taking too long time to finish. Nothing I could
>>> find relevant in the logs.
>>>
>>> Thanks and regards,
>>> Souvik.
>>>
>>> On Wed, Dec 12, 2012 at 8:04 AM, <bejoy...@yahoo.com> wrote:
>>>
>>>> **
>>>> Hi Souvik
>>>>
>>>> Apart from hive jobs is the normal mapreduce jobs like the wordcount
>>>> running fine on your cluster?
>>>>
>>>> If it is working, for the hive jobs are you seeing anything skeptical
>>>> in task, Tasktracker or jobtracker logs?
>>>>
>>>>
>>>> Regards
>>>> Bejoy KS
>>>>
>>>> Sent from remote device, Please excuse typos
>>>> ------------------------------
>>>> *From: * Souvik Banerjee <souvikbaner...@gmail.com>
>>>> *Date: *Tue, 11 Dec 2012 17:12:20 -0600
>>>> *To: *<user@hive.apache.org>; <bejoy...@yahoo.com>
>>>> *ReplyTo: * user@hive.apache.org
>>>> *Subject: *Re: Map side join
>>>>
>>>> Hello Everybody,
>>>>
>>>> Need help in for on HIVE join. As we were talking about the Map side
>>>> join I tried that.
>>>> I set the flag set hive.auto.convert.join=true;
>>>>
>>>> I saw Hive converts the same to map join while launching the job. But
>>>> the problem is that none of the map job progresses in my case. I made the
>>>> dataset smaller. Now it's only 512 MB cross 25 MB. I was expecting it to be
>>>> done very quickly.
>>>> No luck with any change of settings.
>>>> Failing to progress with the default setting changes these settings.
>>>> set hive.mapred.local.mem=1024; // Initially it was 216 I guess
>>>> set hive.join.cache.size=100000; // Initialliu it was 25000
>>>>
>>>> Also on Hadoop side I made this changes
>>>>
>>>> mapred.child.java.opts -Xmx1073741824
>>>>
>>>> But I don't see any progress. After more than 40 minutes of run I am at
>>>> 0% map completion state.
>>>> Can you please throw some light on this?
>>>>
>>>> Thanks a lot once again.
>>>>
>>>> Regards,
>>>> Souvik.
>>>>
>>>>
>>>>
>>>> On Fri, Dec 7, 2012 at 2:32 PM, Souvik Banerjee <
>>>> souvikbaner...@gmail.com> wrote:
>>>>
>>>>> Hi Bejoy,
>>>>>
>>>>> That's wonderful. Thanks for your reply.
>>>>> What I was wondering if HIVE can do map side join with more than one
>>>>> condition on JOIN clause.
>>>>> I'll simply try it out and post the result.
>>>>>
>>>>> Thanks once again.
>>>>>
>>>>> Regards,
>>>>> Souvik.
>>>>>
>>>>>  On Fri, Dec 7, 2012 at 2:10 PM, <bejoy...@yahoo.com> wrote:
>>>>>
>>>>>> **
>>>>>> Hi Souvik
>>>>>>
>>>>>> In earlier versions of hive you had to give the map join hint. But in
>>>>>> later versions just set hive.auto.convert.join = true;
>>>>>> Hive automatically selects the smaller table. It is better to give
>>>>>> the smaller table as the first one in join.
>>>>>>
>>>>>> You can use a map join if you are joining a small table with a large
>>>>>> one, in terms of data size. By small, better to have the smaller table 
>>>>>> size
>>>>>> in range of MBs.
>>>>>> Regards
>>>>>> Bejoy KS
>>>>>>
>>>>>> Sent from remote device, Please excuse typos
>>>>>> ------------------------------
>>>>>> *From: *Souvik Banerjee <souvikbaner...@gmail.com>
>>>>>> *Date: *Fri, 7 Dec 2012 13:58:25 -0600
>>>>>> *To: *<user@hive.apache.org>
>>>>>> *ReplyTo: *user@hive.apache.org
>>>>>> *Subject: *Map side join
>>>>>>
>>>>>> Hello everybody,
>>>>>>
>>>>>> I have got a question. I didn't came across any post which says
>>>>>> somethign about this.
>>>>>> I have got two tables. Lets say A and B.
>>>>>> I want to join A & B in HIVE. I am currently using HIVE 0.9 version.
>>>>>> The join would be on few columns. like on (A.id1 = B.id1) AND (A.id2
>>>>>> = B.id2) AND (A.id3 = B.id3)
>>>>>>
>>>>>> Can I ask HIVE to use map side join in this scenario? Should I give a
>>>>>> hint to HIVE by saying /*+mapjoin(B)*/
>>>>>>
>>>>>> Get back to me if you want any more information in this regard.
>>>>>>
>>>>>> Thanks and regards,
>>>>>> Souvik.
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Map side join

Reply via email to