agreed!

On Jan 29, 2015, at 11:42 PM, matshyeq <matsh...@gmail.com> wrote:

no confusion here.
My use case is exactly the same.
1. What I was saying is my/your join condition looks like (or should look like, 
in your terms):

FROM A JOIN B
ON A.X = B.X
AND A.Y = B.Y

which should trigger merge bucket map join in my opinion:
Data locality information is full - you may look at the partitioning here as 
just another bucketing level - data should be joined within the SAME partitions 
and the SAME buckets, 1:1!
Apparently Hive optimizer is not (yet?) considering partitioning for such 
optimization.
To me it should. Especially for cases where no bucketing is done on tables and 
partitioning columns are used in join from both sides (FROM A JOIN B ON A.X= 
B.X).

2. If your query join is only based on a bucketing condition:
FROM A JOIN B
ON A.Y = B.Y

then the mappers wouldn't know which partition to join data from particular 
bucket. Could still potentially only look for SAME bucket files in ALL 
available partitions but it's not 1:1 relation anymore so probably wouldn't 
gain that much by such optimization. Anyway that optimization doesn't seem to 
be there either.

This thread is only to get a confirmation about the above (or an idea what I 
am/we are doing wrong)

~Maciek

On Thu, Jan 29, 2015 at 5:46 PM, murali parimi 
<muralikrishna.par...@icloud.com> wrote:
Hello apologize for the confusion. Here I will iterate the problem again.

I have two tables A, B which are partitioned on column X and bucketed (Same 
number of buckets) based on column Y. Table A is huge in terms of size (~135GB) 
and Table B is smaller table in terms of size (33GB). Both the tables has 
around 3.1 billion records.Storage format is ORC.

I intended to a sort merger bucket map join hoping there no reducers will be 
spawned and the join will happen on map side. I have used the following 
settings.

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;set hive.enforce.sorting=true;
 
Hive version 13.

Any thoughts! 

Thanks,
Murali


On Jan 29, 2015, at 07:44 PM, matshyeq <matsh...@gmail.com> wrote:

My hunch is while partitioning is in fact very similar to bucketing (actually 
superior as you have some control over what file data goes to) the hive 
optimizer only applies bucket joins if your tables are bucketed so your join 
condition
   t1.bucketed_column = t2.bucketed_column
triggers the bucketed map join
but
   t1.partitioned_column = t2.partitioned_column
doesn't.
I'm hoping someone with deeper Hive knowledge would be able to confirm this.

Thank you,
Kind Regards 
~Maciek

On Thu, Jan 29, 2015 at 1:51 PM, murali parimi 
<muralikrishna.par...@icloud.com> wrote:
I faced the same situation where two tables with 3 billion records on each side 
and partitioned, sorted on same key. Set the following parameters in the hive 
query assuming the join will happen in the map phase.

set hive.input.format=org.apache.hadoop.hive.ql.io.BucketizedHiveInputFormat;
set hive.optimize.bucketmapjoin=true;
set hive.optimize.bucketmapjoin.sortedmerge=true;
set hive.enforce.sorting=true;

I am using hive version 13 and the storage format is Orc. One of the table is 
small in size but I haven't checked whether irfan fit in the cache as we have 
huge memory. But the map sided join didn't happen. What could be the reason?

Sent from my iPhone

On Jan 29, 2015, at 7:38 AM, matshyeq <matsh...@gmail.com> wrote:

I do have two tables partitioned on the same criteria.
Could I still take advantage of Bucket Map Join or better, Sort Merge Bucket 
Map Join?
How?

~Maciek


Reply via email to