If you have a query producing that many partitions it is probably a
bad idea. Consider using hive's bucketing or changing your
partitioning scheme.

Edward

On Wed, Jun 20, 2012 at 12:52 PM, Greg Fuller <greg.ful...@workday.com> wrote:
> Hi,
>
> I sent this out to common-u...@hadoop.apache.org yesterday, but the hive 
> mailing list might be a better forum for this question.
>
> With CDH3u4 and Cloudera Manager, and I am running a hive query to 
> repartition all of our tables.  I'm reducing the number of partitions from 5 
> to 2, because the performance benefits of a smaller mapred.input.dir is 
> significant, which I only realized as our tables have grown in size, and 
> there was little perceived benefit from having the extra partitions 
> considering our typical queries.  After adjusting 
> hive.exec.max.dynamic.partitions to deal with the enormous number of 
> partitions in our larger tables, I got this exception when running the 
> conversion query:
>
> org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
> java.io.IOException: Exceeded max jobconf size: 5445900 limit: 5242880
>       at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:3771)
>       at sun.reflect.GeneratedMethodAccessor17.invoke(Unknown Source)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>        …
>
> I solved this problem in a round-about way such that the query was 
> successful, but I don't understand why and would like to get a better grasp 
> on what is going on.   I tried these things:
>
> 1) In my hive query file, I added "set mapred.user.jobconf.limit=7000000;" 
> before the query, but I saw the exact same exception.
>
> 2) Since setting mapred.user.jobconf.limit from the CLI didn't seem to be 
> working, I used the safety valve for the jobtracker via cloudera manager to 
> add this:
>
> <property>
>   <name>mapred.user.jobconf.limit</name>
>   <value>7000000</value>
> </property>
>
> and then I saved those changes, restarted the job tracker, and reran the 
> query.  I saw the same exception.
>
> Digging further, I used "set -v" in my hive query file to see the value of 
> mapred.user.jobconf.limit, and I discovered:
>
> a) hive -e "set mapred.user.jobconf.limit=7000000; set -v" | grep 
> mapred.user.jobconf.limit showed the value as 7000000, so it seems as if the 
> CLI setting is being observed.
> b) hive -e "set -v" | grep mapred.user.jobconf.limit showed the value as 
> 5242880, which suggests that the safety valve isn't working (?).
>
> 3) Finally, I wondered if there was a hard-coded maximum in the code of 5MB 
> for mapred.user.jobconf.limit, despite looking at the code and seeing nothing 
> obvious, so I tried "set mapred.user.jobconf.limit=100" to set it to a very 
> small value to see if the exception would show that I had exceeded the limit, 
> which should now be reported at '100'.  Guess what?  The query executed 
> successfully, which makes absolutely no sense to me.
>
> FYI, the size in bytes of mapred.input.dir for this query was 5392189.
>
> Does anyone have know why:
>
> 1) The safety valve setting wasn't observed,
> 2) The CLI setting, which seemed to observed was not used, at least according 
> to the limit stated by the exception, and
> 3) Why setting mapred.user.jobconf.limit to an absurdly low number actually 
> allowed the query to be successful?
>
> Thanks,
> Greg

Reply via email to