Alex C created SPARK-51399:
------------------------------

             Summary: Empty result for a stream-stream inner join to compute 
rolling averages
                 Key: SPARK-51399
                 URL: https://issues.apache.org/jira/browse/SPARK-51399
             Project: Spark
          Issue Type: Bug
          Components: Structured Streaming
    Affects Versions: 3.5.4, 3.5.3
         Environment: As per dev container: 
[https://github.com/alza-bitz/playpen-python/blob/spark-streaming-empty-join/.devcontainer/devcontainer.json]

 
            Reporter: Alex C


I've been trying to use Spark Structured Streaming for a hypothetical data 
streaming question against a non-commercial IMDB dataset.

The original raw data: 
[https://datasets.imdbws.com|https://datasets.imdbws.com/]
 
The original schema: 
[https://zindilis.com/posts/imdb-non-commercial-datasets-schema|https://zindilis.com/posts/imdb-non-commercial-datasets-schema/]

The question:

{_}"Determine the ranking for titles, with the ranking determined by: 
(numVotes/averageNumVotes) * averageRating{_}"

Where the ratings are assumed to be continuously streamed in. As such I amended 
the schema to make it an event log, with an additional timestamp field.

The documentation here 
[https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html]
 gave me the impression that the state part of my problem could be managed by 
the framework without any custom state handling. At least with the latest 
version, anyway. Essentially it's a streaming aggregation followed by a 
streaming inner join.

However I ran into a problem: I am getting empty result from the join, even 
though both join inputs are populated with rows where the join keys match when 
I compute those separately (same code, except for watermark handling and 
sinks). Also, the same code produces the expected join output when run in a 
non-streaming manner.

Here are some notebooks illustrating each of the above cases:
 # Streaming inner join with unexpected empty result: 
[https://github.com/alza-bitz/playpen-python/blob/spark-streaming-empty-join/imdb/spark_streaming_join_result.ipynb]

 # Streaming without join but with expected join inputs: 
[https://github.com/alza-bitz/playpen-python/blob/spark-streaming-empty-join/imdb/spark_streaming_join_inputs.ipynb]

 # Non-streaming equivalent for reference: 
[https://github.com/alza-bitz/playpen-python/blob/spark-streaming-empty-join/imdb/spark_sql.ipynb]

These should also be runnable directly in a dev container, the project includes 
a suitable {{devcontainer.json}} that installs Spark version 3.5.3.

At this stage, I'm unclear on whether this is a bug or whether I have 
misinterpreted the docs.

I did find a reference to what might be the same issue on Reddit, with a 
suggestion of updating to version 3.5.4: 
[https://www.reddit.com/r/apachespark/comments/1cb056c/structured_streaming_join_on_window_column|https://www.reddit.com/r/apachespark/comments/1cb056c/structured_streaming_join_on_window_column/]

I tried that, but unfortunately the problem remained.

Thanks!



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to