If I call commit in foreachrdd at the end of a batch, is there still a
possibility of another thread using the same consumer? Assuming I've not
configured scheduler to run parallel jobs.

On Oct 8, 2016 8:39 PM, "Cody Koeninger" <c...@koeninger.org> wrote:

> The underlying kafka consumer isn't thread safe.  Calling the actual
> commit in compute means it's called in the same thread as the other
> consumer calls.
>
> Using kafka as an offset store only works with correctly with
> idempotent datastore writes anyway, so the question of when the commit
> happens shouldn't be an issue.
>
> On Sat, Oct 8, 2016 at 7:25 PM, Srikanth <srikanth...@gmail.com> wrote:
> > Hello,
> >
> > Spark streaming kafka 0.10 integ provides an option to commit offset to
> > kafka using commitAsyn() API.
> > This only records the offset commit request. The actual commit is
> performed
> > in compute() after RDD for next batch is created.
> > Why is this so? Why not do a commit right when the API is called?
> > Anyway the commit process itself is async with an option to provide
> callback
> > handler.
> >
> > This adds a window where application does a commit but it is not
> recorded in
> > kafka internal topic.
> > Any failure during that window will cause the last batch to be
> recomputed.
> >
> > My app does a sink to external source that can't be idempotent. As such
> the
> > operations are assumed to be atleast once.
> > This seems to be one place where duplicates and be reduced.
> >
> > Srikanth
>

Reply via email to