[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16122897#comment-16122897 ]
liyunzhang_intel commented on HIVE-17287: ----------------------------------------- [~gopalv],[~lirui]: the result why the output of join is skewed is because I convert all join to map join. In following query, fact table is store_sales and dimension tables are date_dim,store and item. The total size of date_dim, store and item is smaller than the {{hive.auto.convert.join.noconditionaltask.size}}. Hive starts 11 map works to read store_sales and do map join. There is possibility that there is no records in one map work because no match data in other dimension tables with store_sales. {code} select i_category ,i_class ,i_brand ,i_product_name ,d_year ,d_qoy ,d_moy ,s_store_id ,store_sales.ss_sold_date_sk ,store_sales.ss_item_sk ,store_sales.ss_store_sk from store_sales ,date_dim ,store ,item where store_sales.ss_sold_date_sk=date_dim.d_date_sk and store_sales.ss_item_sk=item.i_item_sk and store_sales.ss_store_sk = store.s_store_sk and d_month_seq between 1193 and 1193+11; {code} It is reasonable that the result of map join is not even but is there any way to make it even? because it will cause the data assigned to the group by tasks is not even if group by operation follows the map join. > HoS can not deal with skewed data group by > ------------------------------------------ > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang_intel > Assignee: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > .... > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)