It's not as odd as it sounds. I want to ensure that long streaming job
outages can recover all the files that went into a directory while the job
was down.
I've looked at
http://apache-spark-user-list.1001560.n3.nabble.com/Generating-a-DStream-by-existing-textfiles-td20030.html#a20039
and
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-td14306.html#a16435
and
https://stackoverflow.com/questions/29022379/spark-streaming-hdfs/29036469#29036469?newreg=e7e25469132d4fbc8350be8f876cf81e
, but all seem unhelpful.
I've tested combinations of the following:
 * fileStreams created with dumb accept-all filters
 * newFilesOnly true and false,
 * tweaking minRememberDuration to high and low values,
 * on hdfs or local directory.
The problem is that it will not read files in the directory from more than a
minute ago.
JavaPairInputDStream<LongWritable, Text> input = context.fileStream(indir,
LongWritable.class, Text.class, TextInputFormat.class, v -> true, false);
Also tried with having set:
context.sparkContext().getConf().set("spark.streaming.minRememberDuration",
"1654564"); to big/small.

Are there known limitations of the onlyNewFiles=false? Am I doing something
wrong?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/fileStream-with-old-files-tp23802.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Reply via email to