[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122929#comment-16122929 ]
Rui Li commented on HIVE-17287: ------------------------------- Hi [~kellyzly], I'm trying to understand how the group by is skewed. If you do group by after the map join, then the 11 map-join tasks will output data to be shuffled again. Only 6 tasks have data to output, and the other 5 tasks don't output because no records are generated by the map join. However, this doesn't mean the following shuffle is necessarily skewed. E.g. if you have 100 downstream tasks, they all can fetch from the 6 upstream tasks, as long as the grouping key is evenly distributed. So have you verified whether the group key is skewed? It'll be strange if the key is not skewed but the shuffle is. One possible reason is spark can somehow give reduce tasks location preference which may affect the case you described, you can trying setting {{spark.shuffle.reduceLocality.enabled=false}} to disable it. > HoS can not deal with skewed data group by > ------------------------------------------ > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > .... > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)