Hi Mark, I'm not sure if I understand what your trying to do correctly, do you know the reference id on which you want to do the join beforehand? Or is one of your tables small? Or are they all big with a small intersection?
I you haven't yet, I would suggest you to have a look at MapJoin: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Joins Mapjoins allow hive to perform a join on the map side by putting the small table in cache. (depending on your hive version, this might or might not be enabled by default) Depending on your use case, you might have to pre-filter some tables, for that you can use sub-queries like: FROM ( SELECT id FROM T WHERE id = myID ) T1 JOIN T2 ON T1.id=T2.id SELECT * Unlike a standard sql that would simplifies the query, I believe this will force Hive to perform the sub-select with a first mapreduce, and apply the join in the second (as a mapjoin if the result of the subquery is small enough). Hope this helps, Furcy 2014-06-09 18:48 GMT+02:00 Mark Desnoyer <desno...@neon-lab.com>: > Hi, > > I was wondering if there was a way in Hive to trigger it to perform an > efficient equality join on large tables? Specifically, I have two or more > tables where the joined key is relatively rare in each table. A good > example would be an AdClick scenario where you would have two tables, one > for ad loads and one for ad clicks, where there is a reference id to > connect a click to a specific ad load and that's the key to join on. > > In the MapReduce framework, this join could be done efficiently by using > the reference id as the key from the map output and then the join would be > done in the reducer since there are a small number of rows associated with > each reference id. However, hive implements join by buffering all but the > last tables in the reducer, and then streams the last table through. This > works great if the first tables are relatively small, but if the tables are > large and the associated rows are rare, it's very inefficient. > > So, can hive perform the join the first way I described? If so, any idea > how I trigger it? Or do I have to write my own MR job? > > Cheers, > Mark >