Hi folks, 1) First Problem: I'm querying MySQL. I submit a query like this:
out = wam.select('message_id', 'business_id', 'info', 'entered_system_date', 'auto_update_time').filter("auto_update_time >= '2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time']) But what I see in the DB is this: SELECT `message_id`,`business_id`,`info`,`entered_system_date`,`auto_update_time` FROM message WHERE (`auto_update_time` IS NOT NULL) Of course the IS NOT NULL is causing a scan of the whole table - SLOOOOW If I do the query straight by loading from MySQL with a query instead of dbtable: myQuery = 'select business_id, entered_system_date, message_id, info, auto_update_time from message where auto_update_time >= "2020-04-01 05:27"' jdbcUrl = jdbc:mysql://172.xx.xx.xx/prod wam = spark\ .read\ .format("jdbc")\ .option("url", jdbcUrl)\ .option("query", myQuery)\ <------------------------ difference here .option("user", connection['scriptParams']['user'])\ .option("password", connection['scriptParams']['pass'])\ .option("numPartitions",10)\ .load() then it's fast as expected. Question: How come is the query adding on the IS NOT NULL - and this is to every single query I try to perform if I use dbtable instead of query in the load/read? 2) Second Problem: Additionally, if I stop this query (I'm using Zeppelin), then the query on the DB side continues running for a long while (read 40 - 60s) after the query stops. Question: How come is this? I mean, it should stop within a second or 3 - not 60! Any advice would be welcome Version of MySQL driver: 8.0.19. Db MySQL 5.7.23 Spark version 2.4.4 Could it be a different version of the JDBC driver? I've tried 5.1.48 too. Same effect. Thanks in advance, Hamish -- Cloud-Fundis.co.za Cape Town, South Africa +27 79 614 4913