I'll sketch out a PR so we can talk code and move the discussion there.
Am 18.03.21 um 14:55 schrieb Wenchen Fan:
I thinkĀ a listener-based API makes sense for streaming (since you need
to keep watching the result), but may not be very reasonable for batch
queries (you only get the result once
I think a listener-based API makes sense for streaming (since you need to
keep watching the result), but may not be very reasonable for batch queries
(you only get the result once). The idea of Observation looks good, but we
should define what happens if `observation.get` is called before the batch
Please follow up the discussion in the origin PR.
https://github.com/apache/spark/pull/26127
Dataset.observe() relies on the query listener for the batch query which is
an "unstable" API - that's why we decided to not add an example for the
batch query. For streaming query, it relies on the stream
I am focusing on batch mode, not streaming mode. I would argue that
Dataset.observe() is equally useful for large batch processing. If you
need some motivating use cases, please let me know.
Anyhow, the documentation of observe states it works for both, batch and
streaming. And in batch mode,
If I remember correctly, the major audience of the "observe" API is
Structured Streaming, micro-batch mode. From the example, the abstraction
in 2 isn't something working with Structured Streaming. It could be still
done with callback, but it remains the question how much complexity is
hidden from