Hello, I have an issue with StreamingQueryListener in my Structured Streaming application written in PySpark. I'm running around 8 queries, and each query runs every 5-20 seconds. In total, I have around ~40 microbatch execution per minute. I set up Python's StreamingQueryListener to collect metrics from queries. The problem is that this listener tends to get delayed over time with the delay growing indefinitely. By delayed, I mean that it doesn't make an event available to a Python listener class immediately after executing a query, but for example, after 2 hours.
My understanding is that by using Python's listener, I pay overhead for communication between JVM and Python using py4j bridge (and internally in Spark, it's a single-threaded process). Indeed, when I implemented a basic listener in Scala, this problem is gone and listener is very fast. Unfortunately, it will be difficult to me to transfer my Python logic (from onQueryProgress) from a Python listener to Scala listener so I prefer much more to stay with the Python one. And the logic itself is not the issue - it's really fast, I measured that. Is there anything I can do to make the Python listener faster? I don't need to collect metrics from all queries, but from what I understand, it's not possible to collect metrics from selected queries - only from all of them or none at all. Andrzej