How to introduce reset logic when aggregating/joining streaming dataframe with static dataframe for spark streaming

2020-07-23 Thread Yong Yuan
A good feature of spark structured streaming is that it can join the static dataframe with the streaming dataframe. To cite an example as below. users is a static dataframe read from database. transactionStream is from a stream. By the joining operation, we can get the spending of each country accu

How to optimize the configuration and/or code to solve the cache overloading issue?

2020-07-22 Thread Yong Yuan
I met a trouble in using spark structured streaming. The usercache is continuously consumed due to the join operation without releasing. How can I optimize the configuration and/or code to solve this problem? Spark Cluster in AWS EMR. 1 master node, m4.xlarge, 4 core, 16GB 2 core nodes, m4.xlarg

Spark Structured Streaming keep on consuming usercache

2020-07-20 Thread Yong Yuan
It seems the following structured streaming code keeps on consuming usercache until all disk space are occupied. val monitoring_stream = monitoring_df.writeStream .trigger(Trigger.ProcessingTime("120 seconds")) .foreachBatch { (batchDF: DataFrame, b

Are there some pitfalls in my spark structured streaming code which causes slow response after several hours running?

2020-07-18 Thread Yong Yuan
The spark job has the correct functions and logic. However, after several hours running, it becomes slower and slower. Are there some pitfalls in the below code? Thanks! val query = "(select * from meta_table) as meta_data" val meta_schema = new StructType() .add("config_id", BooleanType)