You can utilize the following method:

https://spark.apache.org/docs/1.2.0/api/java/org/apache/spark/streaming/StreamingContext.html#fileStream%28java.lang.String,%20scala.Function1,%20boolean,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag,%20scala.reflect.ClassTag%29

It has a parameter:

  newFilesOnly - Should process only new files and ignore existing files in
the directory

And it works as expected.

--
Emre Sevinç



On Fri, Jan 30, 2015 at 7:07 PM, ganterm <gant...@gmail.com> wrote:

> We are running a Spark streaming job that retrieves files from a directory
> (using textFileStream).
> One concern we are having is the case where the job is down but files are
> still being added to the directory.
> Once the job starts up again, those files are not being picked up (since
> they are not new or changed while the job is running) but we would like
> them
> to be processed.
> Is there a solution for that? Is there a way to keep track what files have
> been processed and can we "force" older files to be picked up? Is there a
> way to delete the processed files?
>
> Thanks!
> Markus
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-streaming-tracking-deleting-processed-files-tp21444.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 
Emre Sevinc

Reply via email to