Hi all,
I have recently started assessing structured streaming and ran into a little
snag from the beginning.
Basically I wanted to read some data, do some basic aggregation and write the
result to file:
import org.apache.spark.sql.functions.avg
import org.apache.spark.sql.streaming.ProcessingTime
val rawRecords = spark.readStream.schema(myschema).parquet("/mytest")
val q = rawRecords.withColumn("g",$"id" % 100).groupBy("g").agg(avg($"id"))
val res = q.writeStream.outputMode("complete").trigger(ProcessingTime("10
seconds")).format("parquet").option("path",
"/test2").option("checkpointLocation", "/mycheckpoint").start
The problem is that it tells me that parquet does not support the complete mode
(or update for that matter).
So how would I do a streaming with aggregation to file?
In general, my goal is to have a single (slow) streaming process which would
write some profile and then have a second streaming process which would load
the current dataframe to be used in join (I would stop the second streaming
process and reload the dataframe periodically).
Any help would be appreciated.
Thanks,
Assaf.