can we do self join on streaming dataset in 2.2.0?

kant kodali Sat, 17 Feb 2018 17:18:07 -0800

Hi All,

I know that stream to stream joins are not yet supported. From the text
below I wonder if we can do self joins on the same streaming
dataset/dataframe in 2.2.0 since there are no two explicit streaming
datasets or dataframes?


Thanks!!



In Spark 2.3, we have added support for stream-stream joins, that is, you
can join two streaming Datasets/DataFrames. The challenge of generating
join results between two data streams is that, at any point of time, the
view of the dataset is incomplete for both sides of the join making it much
harder to find matches between inputs. Any row received from one input
stream can match with any future, yet-to-be-received row from the other
input stream. Hence, for both the input streams, we buffer past input as
streaming state, so that we can match every future input with past input
and accordingly generate joined results. Furthermore, similar to streaming
aggregations, we automatically handle late, out-of-order data and can limit
the state using watermarks. Let’s discuss the different types of supported
stream-stream joins and how to use them.

can we do self join on streaming dataset in 2.2.0?

Reply via email to