Alex C created SPARK-51399: ------------------------------ Summary: Empty result for a stream-stream inner join to compute rolling averages Key: SPARK-51399 URL: https://issues.apache.org/jira/browse/SPARK-51399 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.5.4, 3.5.3 Environment: As per dev container: [https://github.com/alza-bitz/playpen-python/blob/spark-streaming-empty-join/.devcontainer/devcontainer.json]
Reporter: Alex C I've been trying to use Spark Structured Streaming for a hypothetical data streaming question against a non-commercial IMDB dataset. The original raw data: [https://datasets.imdbws.com|https://datasets.imdbws.com/] The original schema: [https://zindilis.com/posts/imdb-non-commercial-datasets-schema|https://zindilis.com/posts/imdb-non-commercial-datasets-schema/] The question: {_}"Determine the ranking for titles, with the ranking determined by: (numVotes/averageNumVotes) * averageRating{_}" Where the ratings are assumed to be continuously streamed in. As such I amended the schema to make it an event log, with an additional timestamp field. The documentation here [https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html] gave me the impression that the state part of my problem could be managed by the framework without any custom state handling. At least with the latest version, anyway. Essentially it's a streaming aggregation followed by a streaming inner join. However I ran into a problem: I am getting empty result from the join, even though both join inputs are populated with rows where the join keys match when I compute those separately (same code, except for watermark handling and sinks). Also, the same code produces the expected join output when run in a non-streaming manner. Here are some notebooks illustrating each of the above cases: # Streaming inner join with unexpected empty result: [https://github.com/alza-bitz/playpen-python/blob/spark-streaming-empty-join/imdb/spark_streaming_join_result.ipynb] # Streaming without join but with expected join inputs: [https://github.com/alza-bitz/playpen-python/blob/spark-streaming-empty-join/imdb/spark_streaming_join_inputs.ipynb] # Non-streaming equivalent for reference: [https://github.com/alza-bitz/playpen-python/blob/spark-streaming-empty-join/imdb/spark_sql.ipynb] These should also be runnable directly in a dev container, the project includes a suitable {{devcontainer.json}} that installs Spark version 3.5.3. At this stage, I'm unclear on whether this is a bug or whether I have misinterpreted the docs. I did find a reference to what might be the same issue on Reddit, with a suggestion of updating to version 3.5.4: [https://www.reddit.com/r/apachespark/comments/1cb056c/structured_streaming_join_on_window_column|https://www.reddit.com/r/apachespark/comments/1cb056c/structured_streaming_join_on_window_column/] I tried that, but unfortunately the problem remained. Thanks! -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org