Re: How to optimize multiple count( distinct col) in Hive SQL

panfei Wed, 23 Aug 2017 18:43:43 -0700

by decreasing mapreduce.reduce.shuffle.parallelcopies from 20 to 5,  it
seems that everything goes well, no OOM ~~


2017-08-23 17:19 GMT+08:00 panfei <cnwe...@gmail.com>:

> The full error stack is (which described here : https://issues.apache.org/
> jira/browse/MAPREDUCE-6108) :
>
> this error can not reproduce every time, after retry several times, the
> job successfully finished.
>
> 2017-08-23 17:16:03,574 WARN [main] org.apache.hadoop.mapred.YarnChild: 
> Exception running child : 
> org.apache.hadoop.mapreduce.task.reduce.Shuffle$ShuffleError: error in 
> shuffle in fetcher#2
>       at org.apache.hadoop.mapreduce.task.reduce.Shuffle.run(Shuffle.java:134)
>       at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:376)
>       at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at javax.security.auth.Subject.doAs(Subject.java:422)
>       at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657)
>       at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Caused by: java.lang.OutOfMemoryError: Java heap space
>       at 
> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:56)
>       at 
> org.apache.hadoop.io.BoundedByteArrayOutputStream.<init>(BoundedByteArrayOutputStream.java:46)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.InMemoryMapOutput.<init>(InMemoryMapOutput.java:63)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.unconditionalReserve(MergeManagerImpl.java:305)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.MergeManagerImpl.reserve(MergeManagerImpl.java:295)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyMapOutput(Fetcher.java:514)
>       at 
> org.apache.hadoop.mapreduce.task.reduce.Fetcher.copyFromHost(Fetcher.java:336)
>       at org.apache.hadoop.mapreduce.task.reduce.Fetcher.run(Fetcher.java:193)
>
> 2017-08-23 17:16:03,577 INFO [main] org.apache.hadoop.mapred.Task: Runnning 
> cleanup for the task
>
>
> 2017-08-23 13:10 GMT+08:00 panfei <cnwe...@gmail.com>:
>
>> Hi Gopal, Thanks for all the information and suggestion.
>>
>> The Hive version is 2.0.1 and use Hive-on-MR as the execution engine.
>>
>> I think I should create a intermediate table which includes all the
>> dimensions (including the serval kinds of ids), and then use spark-sql to
>> calculate the distinct values separately (spark sql is really fast so ~~).
>>
>> thanks again.
>>
>> 2017-08-23 12:56 GMT+08:00 Gopal Vijayaraghavan <gop...@apache.org>:
>>
>>> > COUNT(DISTINCT monthly_user_id) AS monthly_active_users,
>>> > COUNT(DISTINCT weekly_user_id) AS weekly_active_users,
>>> …
>>> > GROUPING_ID() AS gid,
>>> > COUNT(1) AS dummy
>>>
>>> There are two things which prevent Hive from optimize multiple count
>>> distincts.
>>>
>>> Another aggregate like a count(1) or a Grouping sets like a ROLLUP/CUBE.
>>>
>>> The multiple count distincts are rewritten into a ROLLUP internally by
>>> the CBO.
>>>
>>> https://issues.apache.org/jira/browse/HIVE-10901
>>>
>>> A single count distinct + other aggregates (like
>>> min,max,count,count_distinct in 1 pass) is fixed via
>>>
>>> https://issues.apache.org/jira/browse/HIVE-16654
>>>
>>> There's no optimizer rule to combine both those scenarios.
>>>
>>> https://issues.apache.org/jira/browse/HIVE-15045
>>>
>>> There's a possibility that you're using Hive-1.x release branch the CBO
>>> doesn't kick in unless column stats are present, but in the Hive-2.x series
>>> you'll notice that some of these optimizations are not driven by a cost
>>> function and are always applied if CBO is enabled.
>>>
>>> > is there any way to rewrite it to optimize the memory usage.
>>>
>>> If you want it to run through very slowly without errors, you can try
>>> disabling all in-memory aggregations.
>>>
>>> set hive.map.aggr=false;
>>>
>>> Cheers,
>>> Gopal
>>>
>>>
>>>
>>
>>
>> --
>> 不学习，不知道
>>
>
>
>
> --
> 不学习，不知道
>



-- 
不学习，不知道

Re: How to optimize multiple count( distinct col) in Hive SQL

Reply via email to