When you use applyInPandasWithState, Spark processes each input row as it arrives, regardless of whether certain columns, such as the timestamp column, contain NULL values. This behavior is useful where you want to handle incomplete or missing data gracefully within your stateful processing logic. By allowing NULL timestamps to trigger calls to the stateful function, you can implement custom handling strategies, such as skipping incomplete records, within your stateful function.
However, it is important to understand that this behavior also *means that the watermark is not advanced for NULL timestamps*. The watermark is used for event-time processing in Spark Structured Streaming, to track the progress of event-time in your data stream and is typically based on the timestamp column. Since NULL timestamps do not contribute to the watermark advancement, Regarding whether you can rely on this behavior for your production code, it largely depends on your requirements and use case. If your application logic is designed to handle NULL timestamps appropriately and you have tested it to ensure it behaves as expected, then you can generally rely on this behavior. FYI, I have not tested it myself, so I cannot provide a definitive answer. Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College London <https://en.wikipedia.org/wiki/Imperial_College_London> London, United Kingdom view my Linkedin profile <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* The information provided is correct to the best of my knowledge but of course cannot be guaranteed . It is essential to note that, as with any advice, quote "one test result is worth one-thousand expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)". On Mon, 27 May 2024 at 22:04, Juan Casse <jca...@gmail.com> wrote: > I am using applyInPandasWithState in PySpark 3.5.0. > > I noticed that records with timestamp==NULL are processed (i.e., trigger a > call to the stateful function). And, as you would expect, does not advance > the watermark. > > I am taking advantage of this in my application. > > My question: Is this a supported feature of Spark? Can I rely on this > behavior for my production code? > > Thanks, > Juan >