[ https://issues.apache.org/jira/browse/HIVE-17287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16121023#comment-16121023 ]
liyunzhang_intel commented on HIVE-17287: ----------------------------------------- [~gopalv]: thanks for your comments {quote} As little as I know about Hive-on-Spark, Query67 does not read any row from the default partition in MR or Tez. {quote} the default_partition stores the data(25.7G) which ss_sold_date_sk is null. In MR/Tez, these part of data will not be load because there is no match data in join? {quote} I suspect HoS is loading each partition as an independent RDD, which removes the effect of SemanticAnalyzer::genNotNullFilterForJoinSourcePlan()? {quote} I also see the filter to filter null data in the explain, although hos load partition as an independent RDD, i think the filter should work in theory. part of explain of query67 {code} Map 1 Map Operator Tree: TableScan alias: store_sales filterExpr: (ss_store_sk is not null and ss_item_sk is not null) (type: boolean) Statistics: Num rows: 8251124389 Data size: 181524736558 Basic stats: COMPLETE Column stats: NONE Filter Operator predicate: (ss_store_sk is not null and ss_item_sk is not null) (type: boolean) Statistics: Num rows: 8251124389 Data size: 181524736558 Basic stats: COMPLETE Column stats: NONE Select Operator expressions: ss_item_sk (type: bigint), ss_store_sk (type: bigint), ss_quantity (type: int), ss_sales_price (type: double), ss_sold_date_sk (type: bigint) outputColumnNames: _col0, _col1, _col2, _col3, _col4 Statistics: Num rows: 8251124389 Data size: 181524736558 Basic stats: COMPLETE Column stats: NONE Map Join Operator {code} > HoS can not deal with skewed data group by > ------------------------------------------ > > Key: HIVE-17287 > URL: https://issues.apache.org/jira/browse/HIVE-17287 > Project: Hive > Issue Type: Bug > Reporter: liyunzhang_intel > > In > [tpcds/query67.sql|https://github.com/kellyzly/hive-testbench/blob/hive14/sample-queries-tpcds/query67.sql], > fact table {{store_sales}} joins with small tables {{date_dim}}, > {{item}},{{store}}. After join, groupby the intermediate data. > Here the data of {{store_sales}} on 3TB tpcds is skewed: there are 1824 > partitions. The biggest partition is 25.7G and others are 715M. > {code} > hadoop fs -du -h > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales > .... > 715.0 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452639 > 713.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452640 > 714.1 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452641 > 712.9 M > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=2452642 > 25.7 G > /user/hive/warehouse/tpcds_bin_partitioned_parquet_3000.db/store_sales/ss_sold_date_sk=__HIVE_DEFAULT_PARTITION__ > {code} > The skewed table {{store_sales}} caused the failed job. Is there any way to > solve the groupby problem of skewed table? I tried to enable > {{hive.groupby.skewindata}} to first divide the data more evenly then start > do group by. But the job still hangs. -- This message was sent by Atlassian JIRA (v6.4.14#64029)