Are Spark Structured Streaming checkpoint files expected to grow over time 
indefinitely? Is there a recommended way to safely age-off old checkpoint data?

Currently we have a Spark Structured Streaming process reading from Kafka and 
writing to an HDFS sink, with checkpointing enabled and writing to a location 
on HDFS. This streaming application has been running for 4 months and over time 
we have noticed that with every 10th job within the application there is about 
a 5 minute delay between when a job finishes and the next job starts which we 
have attributed to the checkpoint compaction process. At this point the 
.compact file that is written is about 2GB in size and the contents of the file 
show it keeps track of files it processed at the very origin of the streaming 
application.

This issue can be reproduced with any Spark Structured Streaming process that 
writes checkpoint files.

Is the best approach for handling the growth of these files to simply delete 
the latest .compact file within the checkpoint directory, and are there any 
associated risks with doing so?


This e-mail, including attachments, may include confidential and/or
proprietary information, and may be used only by the person or entity
to which it is addressed. If the reader of this e-mail is not the intended
recipient or his or her authorized agent, the reader is hereby notified
that any dissemination, distribution or copying of this e-mail is
prohibited. If you have received this e-mail in error, please notify the
sender by replying to this message and delete this e-mail immediately.

Reply via email to