Hi Gopal, Thanks for the speedy response! A follow-up question though: 10Mb input sounds like that would work for a map join. I’m having trouble doing a cross join between two tables that are too big for a map-side join. Trying to break down one table into small enough partitions and then unioning them together seems to give comparable performance to a cross join. I’m running Hive on Map Reduce right now. Short of moving to a different execution engine, are there any performance improvements that can be made to lessen the pain of a cross join? Also, could you please elaborate on your comment “The real trouble is that MapReduce cannot re-direct data at all (there’s only shuffle edges)"? Thanks!
Best, Rory On 11/6/15, 5:09 PM, "Gopal Vijayaraghavan" <go...@hortonworks.com on behalf of gop...@apache.org> wrote: > >> Over the last few week I¹ve been trying to use cross joins/cartesian >>products and was wondering why, exactly, this all gets sent to one >>reducer. All I¹ve heard or read is that Hive can¹t/doesn¹t parallelize >>the job. > >The hashcode of the shuffle key is 0, since you need to process every row >against every key - there's no possibility of dividing up the work. > >Tez will actually have a cross-product edge (TEZ-2104), which is a >distributed cross-product proposal but wasn't picked up in the last Google >Summer of Code. > >The real trouble is that MapReduce cannot re-direct data at all (there's >only shuffle edges). > >> Does anyone have a workaround? > >I use a staged partitioned table as a workaround for this, hashed on a >high nDV key - the goal of the Tez edge is to shuffle the data similarly >at runtime. > >For instance, this python script makes a query with a 19x improvement in >distribution for a cross-product which generates 50+Gb of data from a >~10Mb input. > >https://gist.github.com/t3rmin4t0r/cfb5bb4f7094d595c1e8 > > >It is possible for Hive-Tez to actually generate UNION VertexGroups, but >it's much more efficient to do this as a edge with a custom EdgeManager, >since that opens up potentially implementing ThetaJoins in hive using that. > >Cheers, >Gopal > >