A new configuration named *spark.streaming.minRememberDuration* was added
since 1.2.1 to control the file stream input, the default value is *60
seconds*, you can change this value to a large value to include older files
(older than 1 minute)

You can get the detail from this jira:
https://issues.apache.org/jira/browse/SPARK-3276

-Terry

On Tue, Jul 14, 2015 at 4:44 AM, automaticgiant <hunter.mor...@rackspace.com
> wrote:

> It's not as odd as it sounds. I want to ensure that long streaming job
> outages can recover all the files that went into a directory while the job
> was down.
> I've looked at
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Generating-a-DStream-by-existing-textfiles-td20030.html#a20039
> and
>
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-td14306.html#a16435
> and
>
> https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469#29036469?newreg=e7e25469132d4fbc8350be8f876cf81e
> , but all seem unhelpful.
> I've tested combinations of the following:
>  * fileStreams created with dumb accept-all filters
>  * newFilesOnly true and false,
>  * tweaking minRememberDuration to high and low values,
>  * on hdfs or local directory.
> The problem is that it will not read files in the directory from more than
> a
> minute ago.
> JavaPairInputDStream<LongWritable, Text> input = context.fileStream(indir,
> LongWritable.class, Text.class, TextInputFormat.class, v -> true, false);
> Also tried with having set:
> context.sparkContext().getConf().set("spark.streaming.minRememberDuration",
> "1654564"); to big/small.
>
> Are there known limitations of the onlyNewFiles=false? Am I doing something
> wrong?
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/fileStream-with-old-files-tp23802.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>

Reply via email to