Guys any inputs explaining the rationale on the below question will really
help. Requesting some expert opinion.

Regards,
Sheel

On Sat, 15 Aug, 2020, 1:47 PM Sheel Pancholi, <sheelst...@gmail.com> wrote:

> Hello,
>
> I am trying to figure an appropriate checkpoint interval for my spark
> streaming application. Its Spark Kafka integration based on Direct Streams.
>
> If my *micro batch interval is 2 mins*, and let's say *each microbatch
> takes only 15 secs to process* then shouldn't my checkpoint interval also
> be exactly 2 mins?
>
> Assuming my spark streaming application starts at t=0, following will be
> the state of my checkpoint:
>
> *Case 1: checkpoint interval is less than microbatch interval*
> If I keep my *checkpoint interval at say 1 minute *then:
> *t=1m: *no incomplete batches in this checkpoint
> *t=2m*: first microbatch is included as an incomplete microbatch in the
> checkpoint and microbatch execution then begins
> *t=3m*: no incomplete batches in the checkpoint as the first microbatch
> is finished processing in just 15 secs
> *t=4m*: second microbatch is included as an incomplete microbatch in the
> checkpoint and microbatch execution then begins
> *t=4m30s*: system breaks down; on restarting the streaming application
> finds the checkpoint at t=4 with the second microbatch as the incomplete
> microbatch and processes it again. But what's the point of reprocessing it
> again since the second microbatch's processing was completed at the=4m15s
>
> *Case 2: checkpoint interval is more than microbatch interval*
> If I keep my *checkpoint interval at say 4 minutes* then:
> *t=2m* first microbatch execution begins
> *t=4m* first checkpoint with second microbatch included as the only
> incomplete batch; second microbatch processing begins
>
> *Sub case 1 :* *system breaks down at t=2m30s :* the first microbatch
> execution was completed at the=2m15s but there is no checkpoint information
> about this microbatch since the first checkpoint will happen at t=4m.
> Consequently, when the streaming app restarts it will re-execute by
> fetching the offsets from Kafka.
>
> *Sub case 2 :* *system breaks down at t=5m :* The second microbatch was
> already completed in 15 secs i.e. t=4m15s which means at t=5 there should
> ideally be no incomplete batches. When I restart my application, the
> streaming application finds the second microbatch as incomplete from the
> checkpoint made at t=4m, and re-executes that microbatch.
>
>
> Is my understanding right? If yes, then isn't my checkpoint interval
> incorrectly set resulting in duplicate processing in both the cases above?
> If yes, then how do I choose an appropriate checkpoint interval?
>
> Regards,
> Sheel
>

Reply via email to