StreamingQueryListener in PySpark lagging behind

Andrzej Zera Thu, 31 Oct 2024 09:36:16 -0700

Hello,

I have an issue with StreamingQueryListener in my Structured Streaming
application written in PySpark. I'm running around 8 queries, and each
query runs every 5-20 seconds. In total, I have around ~40 microbatch
execution per minute. I set up Python's StreamingQueryListener to collect
metrics from queries. The problem is that this listener tends to get
delayed over time with the delay growing indefinitely. By delayed, I mean
that it doesn't make an event available to a Python listener class
immediately after executing a query, but for example, after 2 hours.


My understanding is that by using Python's listener, I pay overhead for
communication between JVM and Python using py4j bridge (and internally in
Spark, it's a single-threaded process). Indeed, when I implemented a basic
listener in Scala, this problem is gone and listener is very fast.

Unfortunately, it will be difficult to me to transfer my Python logic (from
onQueryProgress) from a Python listener to Scala listener so I prefer much
more to stay with the Python one. And the logic itself is not the issue -
it's really fast, I measured that.

Is there anything I can do to make the Python listener faster? I don't need
to collect metrics from all queries, but from what I understand, it's not
possible to collect metrics from selected queries - only from all of them
or none at all.

Andrzej

StreamingQueryListener in PySpark lagging behind

Reply via email to