I'm talking about the variant with inverseReduce. For example, if my batch duration is 1s and window duration is 10s, when I start the streaming job I'd want to start with a complete window instead of empty window, given I already have the RDDs for the batches that are missing at startup. After 1s, I'd want the oldest batch to be dropped off, and the inverse reduce being applied to all RDDs as usual.
On Sat, Jul 11, 2015 at 6:50 AM, Tathagata Das <[email protected]> wrote: > Are you talking about reduceByKeyAndWindow with or without inverse reduce? > > TD > > On Fri, Jul 10, 2015 at 2:07 AM, Imran Alam <[email protected]> wrote: > >> We have a streaming job that makes use of reduceByKeyAndWindow >> <https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L334-L341>. >> We want this to work with an initial state. The idea is to avoid losing >> state if the streaming job is restarted, also to take historical data into >> account for the windows. But reduceByKeyAndWindow doesn't accept any >> "initialRDD" parameter like updateStateByKey >> <https://github.com/apache/spark/blob/v1.4.0/streaming/src/main/scala/org/apache/spark/streaming/dstream/PairDStreamFunctions.scala#L435-L445> >> does. >> >> The plan is to extend reduceByKeyAndWindow to accept an "initalRDDs" >> parameter, so that the DStream starts with those RDDs as initial value of >> "generatedRDD" rather than an empty map. But the "generatedRDD" is a >> private variable, so I'm bit confused on how to proceed with the plan. >> >> >
