Re: Cross join/cartesian product explanation

Rory Sawyer Mon, 09 Nov 2015 12:45:52 -0800

Hi Gopal,

Thanks for the speedy response! A follow-up question though: 10Mb input sounds 
like that would work for a map join. I’m having trouble doing a cross join 
between two tables that are too big for a map-side join. Trying to break down 
one table into small enough partitions and then unioning them together seems to 
give comparable performance to a cross join. I’m running Hive on Map Reduce 
right now. Short of moving to a different execution engine, are there any 
performance improvements that can be made to lessen the pain of a cross join? 
Also, could you please elaborate on your comment “The real trouble is that 
MapReduce cannot re-direct data at all (there’s only shuffle edges)"? Thanks!


Best,
Rory



On 11/6/15, 5:09 PM, "Gopal Vijayaraghavan" <go...@hortonworks.com on behalf of 
gop...@apache.org> wrote:

>
>> Over the last few week I¹ve been trying to use cross joins/cartesian
>>products and was wondering why, exactly, this all gets sent to one
>>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize
>>the job. 
>
>The hashcode of the shuffle key is 0, since you need to process every row
>against every key - there's no possibility of dividing up the work.
>
>Tez will actually have a cross-product edge (TEZ-2104), which is a
>distributed cross-product proposal but wasn't picked up in the last Google
>Summer of Code.
>
>The real trouble is that MapReduce cannot re-direct data at all (there's
>only shuffle edges).
>
>> Does anyone have a workaround?
>
>I use a staged partitioned table as a workaround for this, hashed on a
>high nDV key - the goal of the Tez edge is to shuffle the data similarly
>at runtime.
>
>For instance, this python script makes a query with a 19x improvement in
>distribution for a cross-product which generates 50+Gb of data from a
>~10Mb input.
>
>https://gist.github.com/t3rmin4t0r/cfb5bb4f7094d595c1e8
>
>
>It is possible for Hive-Tez to actually generate UNION VertexGroups, but
>it's much more efficient to do this as a edge with a custom EdgeManager,
>since that opens up potentially implementing ThetaJoins in hive using that.
>
>Cheers,
>Gopal
>
>

Re: Cross join/cartesian product explanation

Reply via email to