Is this table bucketed? If so, please set the number of reducers (set mapreduce.job.reduces=bucket_size) to match to the table's bucket size.
On Thu, Dec 17, 2015 at 1:25 AM, Awhan Patnaik <[email protected]> wrote: > 3 node cluster with 15 gigs of RAM per node. Two tables L is approximately > 1 Million rows, U is 100 Million. They both have latitude and longitude > columns. I want to find the count of rows in U that are within a 10 mile > radius of each of the row in L. > > I have indexed the latitude and longitude columns in U. U is date wise > partitioned. U and L are both stored in ORC Snappy file format. > > My query is like this: > > select l.id, count(u.id) from L l, U u > where > u.lat !=0 and > u.lat > l.lat - 10/69 and u.lat < l.lat + 10/69 and > u.lon > l.lon - ( 10 / ( 69 * cos(radians(l.lat)) ) ) and > v.lon < l.lon + ( 10 / ( 69 * cos(radians(l.lat)) ) ) and > 3960 *acos(cos(radians(l.lat)) * cos(radians(u.lat)) * cos(radians(l.lon) > - radians(u.lon)) + sin(radians(l.lat)) * sin(radians(u.lat))) < 10.0 > group by l.id; > > The conditions in the where part enforce a bounding box filtering > constraint based on lat/long values. > > The problem is that this results in 9 mappers but only 1 reducer. I notice > that the job gets stuck at the 67% of the reduce phase. When I run htop I > find that 2 of the nodes are sitting idle while the third node is busy > running the single reduce task. > > I tried using "set mapreduce.job.reduces=50;" but that did not help as the > number of reduce jobs was deduced to be 1 during compile time. > > How do I force more reducers? >
