Hi TD, tnks for getting back on this. Yes that's what I was experiencing - data checkpoints were being recovered from considerable time before the last data checkpoint, probably since the beginning of the first writes, would have to confirm. I have some development on this though.
These results are shown when I run the application from my Windows laptop where I have IntelliJ, while the HDFS file system is on a linux box (with a very reasonable latency!). Couldn't find any exception in the spark logs and I did see metadata checkpoints were recycled on the HDFS folder. Upon recovery I could see the usual Spark streaming timestamp prints on the console jumping from one data checkpoint moment to the next one very slowly. Once I moved the app to the linux box where I had HDFS this problem seemed to go away. If this issue is only happening when running from Windows I won't be so concerned and could go back testing everything on linux. My only concern is if because of substantial HDFS latency to the Spark app there is any kind of race condition between writes and cleanups of HDFS files that could have lead to this finding. Hope this description helps tnks again, Rod -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847p14935.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org