Witold Drabicki created HIVE-25949: -------------------------------------- Summary: predicate pushdown does not seem to work correctly for external tables pointing to MySQL Key: HIVE-25949 URL: https://issues.apache.org/jira/browse/HIVE-25949 Project: Hive Issue Type: Bug Components: Hive Affects Versions: 2.3.7 Environment: Hive runs on *GCP Dataproc,* image version is *1.5.56-debian10* (Hive v {*}2.3.7{*})
*MySQL* server version is {*}5.7.36{*}. The following jars are used: {code:java} add jar gs://d-test-bucket-1/commons-pool-1.6.jar; add jar gs://d-test-bucket-1/hive-jdbc-handler-2.3.7.jar; add jar gs://d-test-bucket-1/commons-dbcp-1.4.jar; add jar gs://d-test-bucket-1/mysql-connector-java-8.0.27.jar; (identical behavior when using mysql-connector-java-5.1.49){code} Reporter: Witold Drabicki The following external table has been created: {code:java} CREATE EXTERNAL TABLE test_table_ext ( col1 integer, col2 integer, col3 bigint, col4 decimal(38,15), ... col13 decimal(38,15), partition_column integer ) STORED BY 'org.apache.hive.storage.jdbc.JdbcStorageHandler' TBLPROPERTIES ( "hive.sql.database.type" = "MYSQL", "hive.sql.jdbc.driver" = "com.mysql.cj.jdbc.Driver", "hive.sql.jdbc.url" = "jdbc:mysql://<ip>/<schema-name>", "hive.sql.dbcp.username" = "<name>", "hive.sql.dbcp.password" = "<password>", "hive.sql.table" = "TEST_TABLE", "hive.sql.schema" = "<schema-name>", "hive.sql.query" = "select col1, col2, col3, col4, ..., col13, partition_column from <schema-name>.TEST_TABLE", "hive.sql.dbcp.maxActive" = "1", "hive.sql.numPartitions" = "1" ); {code} Corresponding MySQL table: {code:java} CREATE TABLE TEST_TABLE ( col1 int(11), col2 int(11), col3 bigint(20), col4 decimal(38,15), ... col13 decimal(38,15), partition_column int(11) ) ENGINE=InnoDB DEFAULT CHARSET=latin1 PARTITION BY LIST (partition_column) {code} The MySQL table has *470678* rows and *363* partitions. Running *ANALYZE TABLE COMPUTE STATISTICS* calculates the number of rows in the table as {*}11303242{*}, which is incorrect. The following 2 queries have been tested: 1) *select count(*) from test_table_ext where col2 = 7602* 2) *select count(*) from test_table_ext where partition_column = 20220207 col2 = 7602* *hive.optimize.ppd* setting is set to *true* Query plan for query #2 is the following: {code:java} Plan optimized by CBO. Vertex dependency in root stage Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE) Stage-0 Fetch Operator limit:-1 Stage-1 Reducer 2 File Output Operator [FS_7] Group By Operator [GBY_5] (rows=1 width=8) Output:["_col0"],aggregations:["count(VALUE._col0)"] <-Map 1 [CUSTOM_SIMPLE_EDGE] PARTITION_ONLY_SHUFFLE [RS_4] Group By Operator [GBY_3] (rows=1 width=8) Output:["_col0"],aggregations:["count()"] Select Operator [SEL_2] (rows=1 width=0) Filter Operator [FIL_8] (rows=1 width=0) predicate:((UDFToDouble(partition_column) = 2.0220207E7) and (UDFToDouble(col2) = 7602.0)) TableScan [TS_0] (rows=11303242 width=0) wd@test_table_ext,test_table_ext,Tbl:PARTIAL,Col:NONE,Output:["col2","partition_column"] {code} Since query #2 is specifying partition value, expected behavior is that it should be noticeably faster that query #1 which has to scan all partitions, but these 2 queries take the same amount of time (150 seconds). This has been tested multiple times. When the same queries are executed directly in MySQL, query #2 runs noticeably faster than query #1 (0.01 vs 0.26 seconds). -- This message was sent by Atlassian Jira (v8.20.1#820001)