by decreasing mapreduce.reduce.shuffle.parallelcopies from 20 to 5, it seems that everything goes well, no OOM ~~
2017-08-23 17:19 GMT+08:00 panfei <cnwe...@gmail.com>: > The full error stack is (which described here : https://issues.apache.org/ > jira/browse/MAPREDUCE-6108) : > > this error can not reproduce every time, after retry several times, the > job successfully finished. > > 2017-08-23 17:16:03,574 WARN [main] org.apache.hadoop.mapred.YarnChild: > Exception running child : > org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in > shuffle in fetcher#2 > at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376) > at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) > at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) > Caused by: java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56) > at > org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46) > at > org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63) > at > org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305) > at > org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514) > at > org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336) > at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193) > > 2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnning > cleanup for the task > > > 2017-08-23 13:10 GMT+08:00 panfei <cnwe...@gmail.com>: > >> Hi Gopal, Thanks for all the information and suggestion. >> >> The Hive version is 2.0.1 and use Hive-on-MR as the execution engine. >> >> I think I should create a intermediate table which includes all the >> dimensions (including the serval kinds of ids), and then use spark-sql to >> calculate the distinct values separately (spark sql is really fast so ~~). >> >> thanks again. >> >> 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gop...@apache.org>: >> >>> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users, >>> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users, >>> … >>> > GROUPING_ID() AS gid, >>> > COUNT(1) AS dummy >>> >>> There are two things which prevent Hive from optimize multiple count >>> distincts. >>> >>> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE. >>> >>> The multiple count distincts are rewritten into a ROLLUP internally by >>> the CBO. >>> >>> https://issues.apache.org/jira/browse/HIVE-10901 >>> >>> A single count distinct + other aggregates (like >>> min,max,count,count_distinct in 1 pass) is fixed via >>> >>> https://issues.apache.org/jira/browse/HIVE-16654 >>> >>> There's no optimizer rule to combine both those scenarios. >>> >>> https://issues.apache.org/jira/browse/HIVE-15045 >>> >>> There's a possibility that you're using Hive-1.x release branch the CBO >>> doesn't kick in unless column stats are present, but in the Hive-2.x series >>> you'll notice that some of these optimizations are not driven by a cost >>> function and are always applied if CBO is enabled. >>> >>> > is there any way to rewrite it to optimize the memory usage. >>> >>> If you want it to run through very slowly without errors, you can try >>> disabling all in-memory aggregations. >>> >>> set hive.map.aggr=false; >>> >>> Cheers, >>> Gopal >>> >>> >>> >> >> >> -- >> 不学习,不知道 >> > > > > -- > 不学习,不知道 > -- 不学习,不知道