Hi Team,

While Running Spark Below are some finding.

   1. FileStreamSourceLog is responsible for maintaining input source file
   list.
   2. Spark Streaming delete expired log files on the basis of s
   *park.sql.streaming.fileSource.log.deletion* and
   *spark.sql.streaming.minBatchesToRetain.*
   3. But while compacting logs Spark Streaming write the complete list of
   files streaming has seen till now in HDFS into one single .compact file.
   4. Over the course of time this compact file  is consuming around
   2GB-5GB in HDFS which will delay creation of compact file after every 10th
   Batch and also job restart time will increase.
   5. Why Spark Streaming is logging files in the system which are already
   deleted . While creating compact file there must be some configured timeout
   so that Spark can skip writing expired list of input files.

*Also kindly let me know if i missed something and there is some
configuration already present to handle this. *

Regards
Pappu Yadav

Reply via email to