I see in the documentation that the distinct operation is not supported
<https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#unsupported-operations>
in Structured Streaming. That being said, I have noticed that you are able
to successfully call distinct() on a data frame and it seems to perform the
desired operation and doesn’t fail with the AnalysisException as expected.
If I call it with a column name specified, then it will fail with
AnalysisException.
I am using Structured Streaming to read from a Kafka stream and my question
(and concern) is that:
- The distinct operation is properly applied across the *current* batch
as read from Kafka, however, the distinct operation would not apply
across batches.
I have tried the following:
- Started the streaming job to see my baseline data and left the job
streaming
- Created events in kafka that would increment my counts if distinct was
not performing as expected
- Results:
- Distinct still seems to be working over the entire data set even as
I add new data.
- As I add new data, I see spark process the data (I’m doing output
mode = update) but there are no new results indicating the distinct
function is in fact still working across batches as spark pulls
in the new
data from kafka.
Does anyone know more about the intended behavior of distinct in Structured
Streaming?
If this is working as intended, does this mean I could have a dataset that
is growing without bound being held in memory/disk or something to that
effect (so it has some way to make that distinct operation against previous
data)?