+1 for this feature In our use case, we probably wouldn’t use this feature in production, but it can be useful during prototyping and algorithm development to repeatedly perform the same streaming operation on a fixed, already existing set of files.
------------------------- jeremyfreeman.net @thefreemanlab On Apr 8, 2015, at 2:51 PM, Emre Sevinc <emre.sev...@gmail.com> wrote: > Tathagata, > > Thanks for stating your preference for Approach 2. > > My use case and motivation are similar to the concerns raised by others in > SPARK-3276. In previous versions of Spark, e.g. 1.1.x we had the ability > for Spark Streaming applications to process the files in an input directory > that existed before the streaming application began, and for some projects > that we did for our customers, we relied on that feature. Starting from > 1.2.x series, we are limited in this respect to the files whose time stamp > is not older than 1 minute. The only workaround is to 'touch' those files > before starting a streaming application. > > Moreover, this MIN_REMEMBER_DURATION is set to an arbitrary value of 1 > minute, and I don't see any argument why it cannot be set to another > arbitrary value (keeping the default value of 1 minute, if nothing is set > by the user). > > Putting all this together, my plan is to create a Pull Request that is like > > 1- Convert "private val MIN_REMEMBER_DURATION" into "private val > minRememberDuration" (to reflect the change that it is not a constant in > the sense that it can be set via configuration) > > 2- Set its value by using something like > getConf("spark.streaming.minRememberDuration", Minutes(1)) > > 3- Document the spark.streaming.minRememberDuration in Spark Streaming > Programming Guide > > If the above sounds fine, then I'll go on implementing this small change > and submit a pull request for fixing SPARK-3276. > > What do you say? > > Kind regards, > > Emre Sevinç > http://www.bigindustries.be/ > > > On Wed, Apr 8, 2015 at 7:16 PM, Tathagata Das <t...@databricks.com> wrote: > >> Approach 2 is definitely better :) >> Can you tell us more about the use case why you want to do this? >> >> TD >> >> On Wed, Apr 8, 2015 at 1:44 AM, Emre Sevinc <emre.sev...@gmail.com> wrote: >> >>> Hello, >>> >>> This is about SPARK-3276 and I want to make MIN_REMEMBER_DURATION (that is >>> now a constant) a variable (configurable, with a default value). Before >>> spending effort on developing something and creating a pull request, I >>> wanted to consult with the core developers to see which approach makes >>> most >>> sense, and has the higher probability of being accepted. >>> >>> The constant MIN_REMEMBER_DURATION can be seen at: >>> >>> >>> >>> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala#L338 >>> >>> it is marked as private member of private[streaming] object >>> FileInputDStream. >>> >>> Approach 1: Make MIN_REMEMBER_DURATION a variable, with a new name of >>> minRememberDuration, and then add a new fileStream method to >>> JavaStreamingContext.scala : >>> >>> >>> >>> https://github.com/apache/spark/blob/master/streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala >>> >>> such that the new fileStream method accepts a new parameter, e.g. >>> minRememberDuration: Int (in seconds), and then use this value to set the >>> private minRememberDuration. >>> >>> >>> Approach 2: Create a new, public Spark configuration property, e.g. named >>> spark.rememberDuration.min (with a default value of 60 seconds), and then >>> set the private variable minRememberDuration to the value of this Spark >>> property. >>> >>> >>> Approach 1 would mean adding a new method to the public API, Approach 2 >>> would mean creating a new public Spark property. Right now, approach 2 >>> seems more straightforward and simpler to me, but nevertheless I wanted to >>> have the opinions of other developers who know the internals of Spark >>> better than I do. >>> >>> Kind regards, >>> Emre Sevinç >>> >> >> > > > -- > Emre Sevinc