having trouble using structured streaming with file sink (parquet)

Mendelson, Assaf Tue, 13 Jun 2017 06:44:06 -0700

Hi all,

I have recently started assessing structured streaming and ran into a little 
snag from the beginning.


Basically I wanted to read some data, do some basic aggregation and write the 
result to file:

import org.apache.spark.sql.functions.avg
import org.apache.spark.sql.streaming.ProcessingTime
val rawRecords = spark.readStream.schema(myschema).parquet("/mytest")
val q = rawRecords.withColumn("g",$"id" % 100).groupBy("g").agg(avg($"id"))
val res = q.writeStream.outputMode("complete").trigger(ProcessingTime("10 
seconds")).format("parquet").option("path", 
"/test2").option("checkpointLocation", "/mycheckpoint").start

The problem is that it tells me that parquet does not support the complete mode 
(or update for that matter).
So how would I do a streaming with aggregation to file?
In general, my goal is to have a single (slow) streaming process which would 
write some profile and then have a second streaming process which would load 
the current dataframe to be used in join (I would stop the second streaming 
process and reload the dataframe periodically).

Any help would be appreciated.

Thanks,
              Assaf.

having trouble using structured streaming with file sink (parquet)

Reply via email to