A good feature of spark structured streaming is that it can join the static
dataframe with the streaming dataframe. To cite an example as below. users
is a static dataframe read from database. transactionStream is from a
stream. By the joining operation, we can get the spending of each country
accu
I met a trouble in using spark structured streaming. The usercache is
continuously consumed due to the join operation without releasing. How can
I optimize the configuration and/or code to solve this problem?
Spark Cluster in AWS EMR.
1 master node, m4.xlarge, 4 core, 16GB
2 core nodes, m4.xlarg
It seems the following structured streaming code keeps on consuming
usercache until all disk space are occupied.
val monitoring_stream =
monitoring_df.writeStream
.trigger(Trigger.ProcessingTime("120 seconds"))
.foreachBatch {
(batchDF: DataFrame, b
The spark job has the correct functions and logic. However, after several
hours running, it becomes slower and slower. Are there some pitfalls in the
below code? Thanks!
val query = "(select * from meta_table) as meta_data"
val meta_schema = new StructType()
.add("config_id", BooleanType)