Hi Luis,

The parameter “spark.cleaner.ttl” and “spark.streaming.unpersist” can be used 
to remove useless timeout streaming data, the difference is that 
“spark.cleaner.ttl” is time-based cleaner, it does not only clean streaming 
input data, but also Spark’s useless metadata; while 
“spark.streaming.unpersist” is reference-based cleaning mechanism, streaming 
data will be removed when out of slide duration.

Both these two parameter can alleviate the memory occupation of Spark 
Streaming. But if the data is flooded into Spark Streaming when start up like 
your situation using Kafka, these two parameters cannot well mitigate the 
problem. Actually you need to control the input data rate to not inject so 
fast, you can try “spark.straming.receiver.maxRate” to control the inject rate.

Thanks
Jerry

From: Luis Ángel Vicente Sánchez [mailto:langel.gro...@gmail.com]
Sent: Wednesday, September 10, 2014 5:21 AM
To: user@spark.apache.org
Subject: spark.cleaner.ttl and spark.streaming.unpersist

The executors of my spark streaming application are being killed due to memory 
issues. The memory consumption is quite high on startup because is the first 
run and there are quite a few events on the kafka queues that are consumed at a 
rate of 100K events per sec.

I wonder if it's recommended to use spark.cleaner.ttl and 
spark.streaming.unpersist together to mitigate that problem. And I also wonder 
if new RDD are being batched while a RDD is being processed.
Regards,

Luis

Reply via email to