[ 
https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16125155#comment-16125155
 ] 

liyunzhang_intel commented on HIVE-17287:
-----------------------------------------

[~lirui]:
bq.Have you tried hive.spark.use.groupby.shuffle? I think it can avoid 
unbounded mem usage.
  I have not enabled {{hive.spark.use.groupby.shuffle}} in my cluster. Will try 
this configuration later. But why in HiveConf it says "Spark groupByKey 
transformation has better performance but uses unbounded memory". Will this use 
unbounded memory?
bq.For the error you mentioned, I usually disable 
yarn.nodemanager.pmem-check-enabled as a workaround.
have disabled this configuration in my cluster but error still occurred.


> HoS can not deal with skewed data group by
> ------------------------------------------
>
>                 Key: HIVE-17287
>                 URL: https://issues.apache.org/jira/browse/HIVE-17287
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: query67-fail-at-groupby.png, 
> query67-groupby_shuffle_metric.png
>
>
> In 
> [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql],
>  fact table {{store_sales}} joins with small tables {{date_dim}}, 
> {{item}},{{store}}. After join, groupby the intermediate data.
> Here the data of {{store_sales}} on 3TB tpcds is skewed:  there are 1824 
> partitions. The biggest partition is 25.7G and others are 715M.
> {code}
> hadoop fs -du -h 
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales
> ....
> 715.0 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639
> 713.9 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640
> 714.1 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641
> 712.9 M  
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642
> 25.7 G   
> /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__
> {code}
> The skewed table {{store_sales}} caused the failed job. Is there any way to 
> solve the groupby problem of skewed table?  I tried to enable 
> {{hive.groupby.skewindata}} to first divide the data more evenly then start 
> do group by. But the job still hangs. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to