Hi TD, tnks for getting back on this.

Yes that's what I was experiencing - data checkpoints were being recovered
from considerable time before the last data checkpoint, probably since the
beginning of the first writes, would have to confirm. I have some
development on this though. 

These results are shown when I run the application from my Windows laptop
where I have IntelliJ, while the HDFS file system is on a linux box (with a
very reasonable latency!). Couldn't find any exception in the spark logs and
I did see metadata checkpoints were recycled on the HDFS folder. 

Upon recovery I could see the usual Spark streaming timestamp prints on the
console jumping from one data checkpoint moment to the next one very slowly.

Once I moved the app to the linux box where I had HDFS this problem seemed
to go away. If this issue is only happening when running from Windows I
won't be so concerned and could go back testing everything on linux. 
My only concern is if because of substantial HDFS latency to the Spark app
there is any kind of race condition between writes and cleanups of HDFS
files that could have lead to this finding.

Hope this description helps

tnks again,
Rod








--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/RDD-data-checkpoint-cleaning-tp14847p14935.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to