If you have a query producing that many partitions it is probably a bad idea. Consider using hive's bucketing or changing your partitioning scheme.
Edward On Wed, Jun 20, 2012 at 12:52 PM, Greg Fuller <greg.ful...@workday.com> wrote: > Hi, > > I sent this out to common-u...@hadoop.apache.org yesterday, but the hive > mailing list might be a better forum for this question. > > With CDH3u4 and Cloudera Manager, and I am running a hive query to > repartition all of our tables. I'm reducing the number of partitions from 5 > to 2, because the performance benefits of a smaller mapred.input.dir is > significant, which I only realized as our tables have grown in size, and > there was little perceived benefit from having the extra partitions > considering our typical queries. After adjusting > hive.exec.max.dynamic.partitions to deal with the enormous number of > partitions in our larger tables, I got this exception when running the > conversion query: > > org.apache.hadoop.ipc.RemoteException: java.io.IOException: > java.io.IOException: Exceeded max jobconf size: 5445900 limit: 5242880 > at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3771) > at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > … > > I solved this problem in a round-about way such that the query was > successful, but I don't understand why and would like to get a better grasp > on what is going on. I tried these things: > > 1) In my hive query file, I added "set mapred.user.jobconf.limit=7000000;" > before the query, but I saw the exact same exception. > > 2) Since setting mapred.user.jobconf.limit from the CLI didn't seem to be > working, I used the safety valve for the jobtracker via cloudera manager to > add this: > > <property> > <name>mapred.user.jobconf.limit</name> > <value>7000000</value> > </property> > > and then I saved those changes, restarted the job tracker, and reran the > query. I saw the same exception. > > Digging further, I used "set -v" in my hive query file to see the value of > mapred.user.jobconf.limit, and I discovered: > > a) hive -e "set mapred.user.jobconf.limit=7000000; set -v" | grep > mapred.user.jobconf.limit showed the value as 7000000, so it seems as if the > CLI setting is being observed. > b) hive -e "set -v" | grep mapred.user.jobconf.limit showed the value as > 5242880, which suggests that the safety valve isn't working (?). > > 3) Finally, I wondered if there was a hard-coded maximum in the code of 5MB > for mapred.user.jobconf.limit, despite looking at the code and seeing nothing > obvious, so I tried "set mapred.user.jobconf.limit=100" to set it to a very > small value to see if the exception would show that I had exceeded the limit, > which should now be reported at '100'. Guess what? The query executed > successfully, which makes absolutely no sense to me. > > FYI, the size in bytes of mapred.input.dir for this query was 5392189. > > Does anyone have know why: > > 1) The safety valve setting wasn't observed, > 2) The CLI setting, which seemed to observed was not used, at least according > to the limit stated by the exception, and > 3) Why setting mapred.user.jobconf.limit to an absurdly low number actually > allowed the query to be successful? > > Thanks, > Greg