You can utilize the following method:
https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream%28java.lang.String,%20scala.Function1,%20boolean,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag%29 It has a parameter: newFilesOnly - Should process only new files and ignore existing files in the directory And it works as expected. -- Emre Sevinç On Fri, Jan 30, 2015 at 7:07 PM, ganterm <gant...@gmail.com> wrote: > We are running a Spark streaming job that retrieves files from a directory > (using textFileStream). > One concern we are having is the case where the job is down but files are > still being added to the directory. > Once the job starts up again, those files are not being picked up (since > they are not new or changed while the job is running) but we would like > them > to be processed. > Is there a solution for that? Is there a way to keep track what files have > been processed and can we "force" older files to be picked up? Is there a > way to delete the processed files? > > Thanks! > Markus > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > > -- Emre Sevinc