ReduceByKeyAndWindow with inverse function has certain unique characteristics - it reuses a lot of the intermediate partitioned, partially-reduced data. For every new batch, it "reduces" the new data, and "inverse reduces" the old out-of-window data. That inverse-reduce needs data from an old batch. Under normal operation, that old batch is partitioned and cached, and so there is only one partitioning - the new data. Thats not the case for the recovery time - the to-be-inverse-reduced old batch data needs to be re-read from kafka and repartitioned again. Hence the two repartitions.
On Sun, Nov 15, 2015 at 9:05 AM, kundan kumar <[email protected]> wrote: > Hi, > > I am using spark streaming check-pointing mechanism and reading the data > from Kafka. The window duration for my application is 2 hrs with a sliding > interval of 15 minutes. > > So, my batches run at following intervals... > > - 09:45 > - 10:00 > - 10:15 > - 10:30 > - and so on > > When my job is restarted, and recovers from the checkpoint it does the > re-partitioning step twice for each 15 minute job until the window of 2 > hours is complete. Then the re-partitioning takes place only once. > > For example - when the job recovers at 16:15 it does re-partitioning for > the 16:15 Kafka stream and the 14:15 Kafka stream as well. Also, all the > other intermediate stages are computed for 16:15 batch. I am using > reduceByKeyAndWindow with inverse function. Now once the 2 hrs window is > complete 18:15 onward re-partitioning takes place only once. Seems like the > checkpoint does not have RDD stored for beyond 2 hrs which is my window > duration. Because of this my job takes more time than usual. > > Is there a way or some configuration parameter which would help avoid > repartitioning twice ? > > Attaching the snaps when repartitioning takes place twice after recovery > from checkpoint. > > Thanks !! > > Kundan > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] >
