Hello, Spark streaming kafka 0.10 integ provides an option to commit offset to kafka using commitAsyn() API. This only records the offset commit request. The actual commit is performed in compute() after RDD for next batch is created. Why is this so? Why not do a commit right when the API is called? Anyway the commit process itself is async with an option to provide callback handler.
This adds a window where application does a commit but it is not recorded in kafka internal topic. Any failure during that window will cause the last batch to be recomputed. My app does a sink to external source that can't be idempotent. As such the operations are assumed to be atleast once. This seems to be one place where duplicates and be reduced. Srikanth