Checkpoints not cleaned using Spark streaming + watermarking + kafka

MathieuP Thu, 21 Sep 2017 14:37:08 -0700

Hi Spark Users ! :)

I come to you with a question about checkpoints. 
I have a streaming application that consumes and produces to Kafka.
The computation requires a window and watermarking.
Since this is a streaming application with a Kafka output, a checkpoint is
expected.


The application runs using spark-submit on a single master and writes on the
local hard drive. 
It runs fine until the number of checkpoints files in "state" directory
totally fills the disk.
It is  due to the fact that there is no more inode available (not a space
issue ; but tens of thousands inodes are consumed).

I searched in the docs and SO.

I've found the settings :
- spark.cleaner.referenceTracking.cleanCheckpoints
- spark.cleaner.periodicGC.interval
I set them from the app and from the command line, without any success.
Do I misuse them ? Is there another setting ?

I can also see this kind of logs :
...
17/09/21 23:27:46 INFO spark.ContextCleaner: Cleaned accumulator 25
17/09/21 23:27:46 INFO spark.ContextCleaner: Cleaned accumulator 11
17/09/21 23:27:46 INFO spark.ContextCleaner: Cleaned shuffle 0
17/09/21 23:27:46 INFO spark.ContextCleaner: Cleaned accumulator 7
...

A sample that reproduces the issue:
The window, watermarking and output trigger durations are set to 10 seconds.
The kafka topic is quite small (2 messages per seconds are added).

https://gist.github.com/anonymous/2e83db84d5190ed1ad7a7d2d5cd632f0

Regards,




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Checkpoints not cleaned using Spark streaming + watermarking + kafka

Reply via email to