Hi, Do you think there is any chance for this issue to get resolved? Should I create another bug report? As mentioned in my message, there is one open already: https://issues.apache.org/jira/browse/SPARK-45637 but it covers only one of the problems.
Andrzej wt., 27 lut 2024 o 09:58 Andrzej Zera <andrzejz...@gmail.com> napisał(a): > Hi, > > Yes, I tested all of them on spark 3.5. > > Regards, > Andrzej > > > pon., 26 lut 2024 o 23:24 Mich Talebzadeh <mich.talebza...@gmail.com> > napisał(a): > >> Hi, >> >> These are all on spark 3.5, correct? >> >> Mich Talebzadeh, >> Dad | Technologist | Solutions Architect | Engineer >> London >> United Kingdom >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* The information provided is correct to the best of my >> knowledge but of course cannot be guaranteed . It is essential to note >> that, as with any advice, quote "one test result is worth one-thousand >> expert opinions (Werner >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun >> <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". >> >> >> On Mon, 26 Feb 2024 at 22:18, Andrzej Zera <andrzejz...@gmail.com> wrote: >> >>> Hey all, >>> >>> I've been using Structured Streaming in production for almost a year >>> already and I want to share the bugs I found in this time. I created a test >>> for each of the issues and put them all here: >>> https://github.com/andrzejzera/spark-bugs/tree/main/spark-3.5/src/test/scala >>> >>> I split the issues into three groups: outer joins on event time, >>> interval joins and Spark SQL. >>> >>> Issues related to outer joins: >>> >>> - When joining three or more input streams on event time, if two or >>> more streams don't contain an event for a join key (which is event time), >>> no row will be output even if other streams contain an event for this >>> join >>> key. Tests that check for this: >>> >>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L86 >>> and >>> >>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L169 >>> - When joining aggregated stream with raw events with a stream with >>> already aggregated events (aggregation made outside of Spark), then no >>> row >>> will be output if that second stream don't contain a corresponding event. >>> Test that checks for this: >>> >>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L266 >>> - When joining two aggregated streams (aggregated in Spark), no >>> result is produced. Test that checks for this: >>> >>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/OuterJoinTest.scala#L341. >>> I've already reported this one here: >>> https://issues.apache.org/jira/browse/SPARK-45637 but it hasn't been >>> handled yet. >>> >>> Issues related to interval joins: >>> >>> - When joining three streams (A, B, C) using interval join on event >>> time, in the way that B.eventTime is conditioned on A.eventTime and >>> C.eventTime is also conditioned on A.eventTime, and then doing window >>> aggregation based on A's event time, the result is output only after >>> watermark crosses the window end + interval(A, B) + interval (A, C). >>> However, I'd expect results to be output faster, i.e. when the watermark >>> crosses window end + MAX(interval(A, B) + interval (A, C)). If our case >>> is >>> that event B can happen 3 minutes after event A and event C can happen 5 >>> minutes after A, there is no point to suspend reporting output for 8 >>> minutes (3+5) after the end of the window if we know that no more event >>> can >>> be matched after 5 min from the window end (assuming window end is based >>> on >>> A's event time). Test that checks for this: >>> >>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/IntervalJoinTest.scala#L32 >>> >>> SQL issues: >>> >>> - WITH clause (in contrast to subquery) seems to create a static >>> DataFrame that can't be used in streaming joins. Test that checks for >>> this: >>> >>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/SqlSyntaxTest.scala#L31 >>> - Two subqueries, each aggregating data using window() functio, >>> breaks the output schema. Test that checks for this: >>> >>> https://github.com/andrzejzera/spark-bugs/blob/abae7a3839326a8eafc7516a51aca5e0c79282a6/spark-3.5/src/test/scala/SqlSyntaxTest.scala#L122 >>> >>> I'm a beginner with Scala (I'm using Structured Streaming with PySpark) >>> so won't be able to provide fixes. But I hope the test cases I provided can >>> be of some help. >>> >>> Regards, >>> Andrzej >>> >>