If I call commit in foreachrdd at the end of a batch, is there still a possibility of another thread using the same consumer? Assuming I've not configured scheduler to run parallel jobs.
On Oct 8, 2016 8:39 PM, "Cody Koeninger" <c...@koeninger.org> wrote: > The underlying kafka consumer isn't thread safe. Calling the actual > commit in compute means it's called in the same thread as the other > consumer calls. > > Using kafka as an offset store only works with correctly with > idempotent datastore writes anyway, so the question of when the commit > happens shouldn't be an issue. > > On Sat, Oct 8, 2016 at 7:25 PM, Srikanth <srikanth...@gmail.com> wrote: > > Hello, > > > > Spark streaming kafka 0.10 integ provides an option to commit offset to > > kafka using commitAsyn() API. > > This only records the offset commit request. The actual commit is > performed > > in compute() after RDD for next batch is created. > > Why is this so? Why not do a commit right when the API is called? > > Anyway the commit process itself is async with an option to provide > callback > > handler. > > > > This adds a window where application does a commit but it is not > recorded in > > kafka internal topic. > > Any failure during that window will cause the last batch to be > recomputed. > > > > My app does a sink to external source that can't be idempotent. As such > the > > operations are assumed to be atleast once. > > This seems to be one place where duplicates and be reduced. > > > > Srikanth >