Guys any inputs explaining the rationale on the below question will really help. Requesting some expert opinion.
Regards, Sheel On Sat, 15 Aug, 2020, 1:47 PM Sheel Pancholi, <sheelst...@gmail.com> wrote: > Hello, > > I am trying to figure an appropriate checkpoint interval for my spark > streaming application. Its Spark Kafka integration based on Direct Streams. > > If my *micro batch interval is 2 mins*, and let's say *each microbatch > takes only 15 secs to process* then shouldn't my checkpoint interval also > be exactly 2 mins? > > Assuming my spark streaming application starts at t=0, following will be > the state of my checkpoint: > > *Case 1: checkpoint interval is less than microbatch interval* > If I keep my *checkpoint interval at say 1 minute *then: > *t=1m: *no incomplete batches in this checkpoint > *t=2m*: first microbatch is included as an incomplete microbatch in the > checkpoint and microbatch execution then begins > *t=3m*: no incomplete batches in the checkpoint as the first microbatch > is finished processing in just 15 secs > *t=4m*: second microbatch is included as an incomplete microbatch in the > checkpoint and microbatch execution then begins > *t=4m30s*: system breaks down; on restarting the streaming application > finds the checkpoint at t=4 with the second microbatch as the incomplete > microbatch and processes it again. But what's the point of reprocessing it > again since the second microbatch's processing was completed at the=4m15s > > *Case 2: checkpoint interval is more than microbatch interval* > If I keep my *checkpoint interval at say 4 minutes* then: > *t=2m* first microbatch execution begins > *t=4m* first checkpoint with second microbatch included as the only > incomplete batch; second microbatch processing begins > > *Sub case 1 :* *system breaks down at t=2m30s :* the first microbatch > execution was completed at the=2m15s but there is no checkpoint information > about this microbatch since the first checkpoint will happen at t=4m. > Consequently, when the streaming app restarts it will re-execute by > fetching the offsets from Kafka. > > *Sub case 2 :* *system breaks down at t=5m :* The second microbatch was > already completed in 15 secs i.e. t=4m15s which means at t=5 there should > ideally be no incomplete batches. When I restart my application, the > streaming application finds the second microbatch as incomplete from the > checkpoint made at t=4m, and re-executes that microbatch. > > > Is my understanding right? If yes, then isn't my checkpoint interval > incorrectly set resulting in duplicate processing in both the cases above? > If yes, then how do I choose an appropriate checkpoint interval? > > Regards, > Sheel >