[ https://issues.apache.org/jira/browse/HIVE-13995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15383923#comment-15383923 ]
Hive QA commented on HIVE-13995: -------------------------------- Here are the results of testing the latest attachment: https://issues.apache.org/jira/secure/attachment/12818732/HIVE-13995.4.patch {color:red}ERROR:{color} -1 due to no test(s) being added or modified. {color:red}ERROR:{color} -1 due to 70 failed/errored test(s), 10336 tests executed *Failed tests:* {noformat} org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_acid_globallimit org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_annotate_stats_part org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_autoColumnStats_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_autoColumnStats_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_sortmerge_join_12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_gby org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_limit org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_gby org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_join org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_join0 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_limit org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_semijoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_rp_views org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_semijoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_cbo_views org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_full org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_extrapolate_part_stats_partial_ndv org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_cond_pushdown org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_join_parse org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_mapjoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mergejoin org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_special_character_in_tabnames_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_only_null org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_subquery_multiinsert org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver_bucket_map_join_tez1 org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver_bucket_map_join_tez2 org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver_tez_join_hash org.apache.hadoop.hive.cli.TestMiniLlapCliDriver.testCliDriver_tez_smb_main org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_acid_globallimit org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_auto_sortmerge_join_12 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_bucket_map_join_tez1 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_bucket_map_join_tez2 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_gby org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_join org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_limit org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_semijoin org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_stats org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_udf_udaf org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_union org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_views org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_cbo_windowing org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_explainuser_1 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_explainuser_2 org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_mapjoin_mapjoin org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_mergejoin org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_stats_only_null org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_tez_join_hash org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_tez_smb_empty org.apache.hadoop.hive.cli.TestMiniTezCliDriver.testCliDriver_tez_smb_main org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_auto_sortmerge_join_12 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_bucket_map_join_tez1 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_bucket_map_join_tez2 org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_gby org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_limit org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_semijoin org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_stats org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_udf_udaf org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_cbo_union org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_mapjoin_mapjoin org.apache.hadoop.hive.cli.TestSparkCliDriver.testCliDriver_stats_only_null org.apache.hadoop.hive.llap.daemon.impl.TestLlapTokenChecker.testCheckPermissions org.apache.hadoop.hive.llap.daemon.impl.TestLlapTokenChecker.testGetToken org.apache.hadoop.hive.llap.tezplugins.TestLlapTaskSchedulerService.testDelayedLocalityNodeCommErrorImmediateAllocation org.apache.hadoop.hive.metastore.TestMetaStoreMetrics.testConnections org.apache.hadoop.hive.ql.security.TestStorageBasedMetastoreAuthorizationDrops.testDropPartition {noformat} Test results: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/575/testReport Console output: https://builds.apache.org/job/PreCommit-HIVE-MASTER-Build/575/console Test logs: http://ec2-204-236-174-241.us-west-1.compute.amazonaws.com/logs/PreCommit-HIVE-MASTER-Build-575/ Messages: {noformat} Executing org.apache.hive.ptest.execution.TestCheckPhase Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 70 tests failed {noformat} This message is automatically generated. ATTACHMENT ID: 12818732 - PreCommit-HIVE-MASTER-Build > Hive generates inefficient metastore queries for TPCDS tables with 1800+ > partitions leading to higher compile time > ------------------------------------------------------------------------------------------------------------------ > > Key: HIVE-13995 > URL: https://issues.apache.org/jira/browse/HIVE-13995 > Project: Hive > Issue Type: Bug > Components: Hive > Affects Versions: 2.2.0 > Reporter: Nita Dembla > Assignee: Hari Sankar Sivarama Subramaniyan > Attachments: HIVE-13995.1.patch, HIVE-13995.2.patch, > HIVE-13995.3.patch, HIVE-13995.4.patch > > > TPCDS fact tables (store_sales, catalog_sales) have 1800+ partitions and when > the query does not a filter on the partition column, metastore queries > generated have a large IN clause listing all the partition names. Most RDBMS > systems have issues optimizing large IN clause and even when a good index > plan is chosen , comparing to 1800+ string values will not lead to best > execution time. > When all partitions are chosen, not specifying the partition list and having > filters only on table and column name will generate the same result set as > long as there are no concurrent modifications to partition list of the hive > table (adding/dropping partitions). > For eg: For TPCDS query18, the metastore query gathering partition column > statistics runs in 0.5 secs in Mysql. Following is output from mysql log > {noformat} > -- Query_time: 0.482063 Lock_time: 0.003037 Rows_sent: 1836 Rows_examined: > 18360 > select count("COLUMN_NAME") from "PART_COL_STATS" > where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = > 'catalog_sales' > and "COLUMN_NAME" in > ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit') > and "PARTITION_NAME" in > ('cs_sold_date_sk=2450815','cs_sold_date_sk=2450816','cs_sold_date_sk=2450817','cs_sold_date_sk=2450818','cs_sold_date_sk=2450819','cs_sold_date_sk=2450820','cs_sold_date_sk=2450821','cs_sold_date_sk=2450822','cs_sold_date_sk=2450823','cs_sold_date_sk=2450824','cs_sold_date_sk=2450825','cs_sold_date_sk=2450826','cs_sold_date_sk=2450827','cs_sold_date_sk=2450828','cs_sold_date_sk=2450829','cs_sold_date_sk=2450830','cs_sold_date_sk=2450831','cs_sold_date_sk=2450832','cs_sold_date_sk=2450833','cs_sold_date_sk=2450834','cs_sold_date_sk=2450835','cs_sold_date_sk=2450836','cs_sold_date_sk=2450837','cs_sold_date_sk=2450838','cs_sold_date_sk=2450839','cs_sold_date_sk=2450840','cs_sold_date_sk=2450841','cs_sold_date_sk=2450842','cs_sold_date_sk=2450843','cs_sold_date_sk=2450844','cs_sold_date_sk=2450845','cs_sold_date_sk=2450846','cs_sold_date_sk=2450847','cs_sold_date_sk=2450848','cs_sold_date_sk=2450849','cs_sold_date_sk=2450850','cs_sold_date_sk=2450851','cs_sold_date_sk=2450852','cs_sold_date_sk=2450853','cs_sold_date_sk=2450854','cs_sold_date_sk=2450855','cs_sold_date_sk=2450856',...,'cs_sold_date_sk=2452654') > group by "PARTITION_NAME"; > {noformat} > Functionally equivalent query runs in 0.1 seconds > {noformat} > --Query_time: 0.121296 Lock_time: 0.000156 Rows_sent: 1836 Rows_examined: > 18360 > select count("COLUMN_NAME") from "PART_COL_STATS" > where "DB_NAME" = 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = > 'catalog_sales' and "COLUMN_NAME" in > ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit') > group by "PARTITION_NAME"; > {noformat} > If removing the partition list seems drastic, its also possible to simply > list the range since hive gets a ordered list of partition names. This > performs equally well as earlier query > {noformat} > # Query_time: 0.143874 Lock_time: 0.000154 Rows_sent: 1836 Rows_examined: > 18360 > SET timestamp=1464014881; > select count("COLUMN_NAME") from "PART_COL_STATS" where "DB_NAME" = > 'tpcds_bin_partitioned_orc_30000' and "TABLE_NAME" = 'catalog_sales' and > "COLUMN_NAME" in > ('cs_bill_customer_sk','cs_bill_cdemo_sk','cs_item_sk','cs_quantity','cs_list_price','cs_sales_price','cs_coupon_amt','cs_net_profit') > and "PARTITION_NAME" >= 'cs_sold_date_sk=2450815' and "PARTITION_NAME" <= > 'cs_sold_date_sk=2452654' > group by "PARTITION_NAME"; > {noformat} > Another thing to check is the IN clause of column names. Columns in > projection list of hive query are mentioned here. Not sure if statistics of > these columns are required for hive query optimization. -- This message was sent by Atlassian JIRA (v6.3.4#6332)