It was added, but its not documented publicly. I am planning to change the name of the conf to spark.streaming.fileStream.minRememberDuration to make it easier to understand
On Mon, Jul 13, 2015 at 9:43 PM, Terry Hole <hujie.ea...@gmail.com> wrote: > A new configuration named *spark.streaming.minRememberDuration* was added > since 1.2.1 to control the file stream input, the default value is *60 > seconds*, you can change this value to a large value to include older > files (older than 1 minute) > > You can get the detail from this jira: > https://issues.apache.org/jira/browse/SPARK-3276 > > -Terry > > On Tue, Jul 14, 2015 at 4:44 AM, automaticgiant < > hunter.mor...@rackspace.com> wrote: > >> It's not as odd as it sounds. I want to ensure that long streaming job >> outages can recover all the files that went into a directory while the job >> was down. >> I've looked at >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Generating-a-DStream-by-existing-textfiles-td20030.html#a20039 >> and >> >> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-td14306.html#a16435 >> and >> >> https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469#29036469?newreg=e7e25469132d4fbc8350be8f876cf81e >> , but all seem unhelpful. >> I've tested combinations of the following: >> * fileStreams created with dumb accept-all filters >> * newFilesOnly true and false, >> * tweaking minRememberDuration to high and low values, >> * on hdfs or local directory. >> The problem is that it will not read files in the directory from more >> than a >> minute ago. >> JavaPairInputDStream<LongWritable, Text> input = context.fileStream(indir, >> LongWritable.class, Text.class, TextInputFormat.class, v -> true, false); >> Also tried with having set: >> >> context.sparkContext().getConf().set("spark.streaming.minRememberDuration", >> "1654564"); to big/small. >> >> Are there known limitations of the onlyNewFiles=false? Am I doing >> something >> wrong? >> >> >> >> -- >> View this message in context: >> http://apache-spark-user-list.1001560.n3.nabble.com/fileStream-with-old-files-tp23802.html >> Sent from the Apache Spark User List mailing list archive at Nabble.com. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org >> For additional commands, e-mail: user-h...@spark.apache.org >> >> >