Re: map/foreachRDD equivalent for pyspark Structured Streaming

peay Thu, 04 May 2017 23:37:03 -0700

Hello,

So, I assume there is nothing to apply/transform in structured streaming based 
on a function that takes a dataframe as input and output a dataframe as input?


UDAF are kind of low level and require you to implement merge, and process 
individual rows in AFAIK (and are not available in pyspark).

Instead, I would like to directly transform the dataframe for a given window 
using the powerful high-level API for dataframes, kind of like map for RDD.

Any idea if something like this could be supported in the future?

Thanks,

-------- Original Message --------
Subject: Re: map/foreachRDD equivalent for pyspark Structured Streaming
Local Time: 3 May 2017 12:05 PM
UTC Time: 3 May 2017 10:05
From: tathagata.das1...@gmail.com
To: peay <p...@protonmail.com>
user@spark.apache.org <user@spark.apache.org>

You can apply apply any kind of aggregation on windows. There are some built in 
aggregations (e.g. sum and count) as well as there is an API for user-defined 
aggregations (scala/Java) that works with both batch and streaming DFs.
See the programming guide if you havent seen it already
- windowing - 
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
- UDAFs on typesafe Dataset (scala/java) - 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.Aggregator
- UDAFs on generic DataFrames (scala/java) - 
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction

You can register a UDAF defined in Scala with a name and then call that 
function by name in SQL

- https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

I feel a combination of these should be sufficient for you. Hope this helps.

On Wed, May 3, 2017 at 1:51 AM, peay <p...@protonmail.com> wrote:
Hello,

I would like to get started on Spark Streaming with a simple window.

I've got some existing Spark code that takes a dataframe, and outputs a 
dataframe. This includes various joins and operations that are not supported by 
structured streaming yet. I am looking to essentially map/apply this on the 
data for each window.

Is there any way to apply a function to a dataframe that would correspond to 
each window? This would mean accumulate data until watermark is reached, and 
then mapping the full corresponding dataframe.

I am using pyspark. I've seen the foreach writer, but it seems to operate at 
partition level instead of a full "window dataframe" and is not available for 
Python anyway.

Thanks!

Re: map/foreachRDD equivalent for pyspark Structured Streaming

Reply via email to