The underlying kafka consumer isn't thread safe.  Calling the actual
commit in compute means it's called in the same thread as the other
consumer calls.

Using kafka as an offset store only works with correctly with
idempotent datastore writes anyway, so the question of when the commit
happens shouldn't be an issue.

On Sat, Oct 8, 2016 at 7:25 PM, Srikanth <srikanth...@gmail.com> wrote:
> Hello,
>
> Spark streaming kafka 0.10 integ provides an option to commit offset to
> kafka using commitAsyn() API.
> This only records the offset commit request. The actual commit is performed
> in compute() after RDD for next batch is created.
> Why is this so? Why not do a commit right when the API is called?
> Anyway the commit process itself is async with an option to provide callback
> handler.
>
> This adds a window where application does a commit but it is not recorded in
> kafka internal topic.
> Any failure during that window will cause the last batch to be recomputed.
>
> My app does a sink to external source that can't be idempotent. As such the
> operations are assumed to be atleast once.
> This seems to be one place where duplicates and be reduced.
>
> Srikanth

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to