Hi folks,

1) First Problem:
I'm querying MySQL. I submit a query like this:

out = wam.select('message_id', 'business_id', 'info',
'entered_system_date', 'auto_update_time').filter("auto_update_time >=
'2020-04-01 05:27'").dropDuplicates(['message_id', 'auto_update_time'])

But what I see in the DB is this:

SELECT
`message_id`,`business_id`,`info`,`entered_system_date`,`auto_update_time`
FROM message WHERE (`auto_update_time` IS NOT NULL)

Of course the IS NOT NULL is causing a scan of the whole table - SLOOOOW

If I do the query straight by loading from MySQL with a query instead of
dbtable:

myQuery = 'select business_id, entered_system_date, message_id, info,
auto_update_time from message where auto_update_time >= "2020-04-01 05:27"'

jdbcUrl = jdbc:mysql://172.xx.xx.xx/prod

wam = spark\
    .read\
    .format("jdbc")\
    .option("url", jdbcUrl)\
    .option("query", myQuery)\        <------------------------ difference
here
    .option("user", connection['scriptParams']['user'])\
    .option("password", connection['scriptParams']['pass'])\
    .option("numPartitions",10)\
    .load()

then it's fast as expected.

Question: How come is the query adding on the IS NOT NULL - and this is to
every single query I try to perform if I use dbtable instead of query in
the load/read?

2) Second Problem:
Additionally, if I stop this query (I'm using Zeppelin), then the query on
the DB side continues running for a long while (read 40 - 60s) after the
query stops.

Question: How come is this? I mean, it should stop within a second or 3 -
not 60!

Any advice would be welcome

Version of MySQL driver: 8.0.19.
Db MySQL 5.7.23
Spark version 2.4.4

Could it be a different version of the JDBC driver? I've tried 5.1.48 too.
Same effect.

Thanks in advance,
Hamish
-- 
Cloud-Fundis.co.za
Cape Town, South Africa
+27 79 614 4913

Reply via email to