3 node cluster with 15 gigs of RAM per node. Two tables L is approximately 1 Million rows, U is 100 Million. They both have latitude and longitude columns. I want to find the count of rows in U that are within a 10 mile radius of each of the row in L.
I have indexed the latitude and longitude columns in U. U is date wise partitioned. U and L are both stored in ORC Snappy file format. My query is like this: select l.id, count(u.id) from L l, U u where u.lat !=0 and u.lat > l.lat - 10/69 and u.lat < l.lat + 10/69 and u.lon > l.lon - ( 10 / ( 69 * cos(radians(l.lat)) ) ) and v.lon < l.lon + ( 10 / ( 69 * cos(radians(l.lat)) ) ) and 3960 *acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.lon) - radians(u.lon)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0 group by l.id; The conditions in the where part enforce a bounding box filtering constraint based on lat/long values. The problem is that this results in 9 mappers but only 1 reducer. I notice that the job gets stuck at the 67% of the reduce phase. When I run htop I find that 2 of the nodes are sitting idle while the third node is busy running the single reduce task. I tried using "set mapreduce.job.reduces=50;" but that did not help as the number of reduce jobs was deduced to be 1 during compile time. How do I force more reducers?
