[Spark SQL] How can I use .sql() in conjunction with watermarks?

Chloe He Tue, 02 Apr 2024 12:23:32 -0700

Hello!

I am attempting to write a streaming pipeline that would consume data from a 
Kafka source, manipulate the data, and then write results to a downstream sink 
(Kafka, Redis, etc). I want to write fully formed SQL instead of using the 
function API that Spark offers. I read a few guides on how to do this and my 
understanding is that I need to create a temp view in order to execute my raw 
SQL queries via spark.sql().


However, I’m having trouble defining watermarks on my source. It doesn’t seem 
like there is a way to introduce watermark in the raw SQL that Spark supports, 
so I’m using the .withWatermark() function. However, this watermark does not 
work on the temp view.

Example code:
```
streaming_df.select(from_json(col("value").cast("string"), 
schema).alias("parsed_value")).select("parsed_value.*").withWatermark("createTime",
 "10 seconds”)

json_df.createOrReplaceTempView("json_df”)

session.sql("""
SELECT
window.start, window.end, provinceId, sum(payAmount) as totalPayAmount
FROM json_df
GROUP BY provinceId, window('createTime', '1 hour', '30 minutes')
ORDER BY window.start
""")\
.writeStream\
.format("kafka") \
.option("checkpointLocation", "checkpoint") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("topic", "sink") \
.start()
```
This throws
```
AnalysisException: Append output mode not supported when there are streaming 
aggregations on streaming DataFrames/DataSets without watermark;
```

If I switch out the SQL query and write it in the function API instead, 
everything seems to work fine.

How can I use .sql() in conjunction with watermarks?

Best,
Chloe

[Spark SQL] How can I use .sql() in conjunction with watermarks?

Reply via email to