Hello, So, I assume there is nothing to apply/transform in structured streaming based on a function that takes a dataframe as input and output a dataframe as input?
UDAF are kind of low level and require you to implement merge, and process individual rows in AFAIK (and are not available in pyspark). Instead, I would like to directly transform the dataframe for a given window using the powerful high-level API for dataframes, kind of like map for RDD. Any idea if something like this could be supported in the future? Thanks, -------- Original Message -------- Subject: Re: map/foreachRDD equivalent for pyspark Structured Streaming Local Time: 3 May 2017 12:05 PM UTC Time: 3 May 2017 10:05 From: tathagata.das1...@gmail.com To: peay <p...@protonmail.com> user@spark.apache.org <user@spark.apache.org> You can apply apply any kind of aggregation on windows. There are some built in aggregations (e.g. sum and count) as well as there is an API for user-defined aggregations (scala/Java) that works with both batch and streaming DFs. See the programming guide if you havent seen it already - windowing - http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time - UDAFs on typesafe Dataset (scala/java) - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.Aggregator - UDAFs on generic DataFrames (scala/java) - http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction You can register a UDAF defined in Scala with a name and then call that function by name in SQL - https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html I feel a combination of these should be sufficient for you. Hope this helps. On Wed, May 3, 2017 at 1:51 AM, peay <p...@protonmail.com> wrote: Hello, I would like to get started on Spark Streaming with a simple window. I've got some existing Spark code that takes a dataframe, and outputs a dataframe. This includes various joins and operations that are not supported by structured streaming yet. I am looking to essentially map/apply this on the data for each window. Is there any way to apply a function to a dataframe that would correspond to each window? This would mean accumulate data until watermark is reached, and then mapping the full corresponding dataframe. I am using pyspark. I've seen the foreach writer, but it seems to operate at partition level instead of a full "window dataframe" and is not available for Python anyway. Thanks!