Re: Micro-batching in Kafka streams - redux

Matthias J. Sax Fri, 20 Oct 2017 12:54:37 -0700

Hi,

as Kafka Streams focuses on stream processing, micro-batching is
something we don't consider. Thus, nothing has changed/improved.


About the store question:

If you buffer up your writes in a store, you need to delete those value
from the store later on to avoid that the store grown unbounded. If you
do this, Kafka Streams will also write corresponding tombstones to the
changelog topic and thus, compaction will just work fine.

General comment: a strict alignment of offset commits and your batch
writes to an external store is not possible, as there is no
callback/notification when an offset commit happens. However, you also
don't need to to build a correct solution. You can just define your
external write interval as whatever value you want it.

Hope this helps.


-Matthias

On 10/20/17 12:36 PM, Shrijeet Paliwal wrote:
> Kafka version: 0.10.2.1 (can upgrade if needed)
> 
> I wish to revive the discussion around micro batching in Kafka streams.
> Some past discussions are here <http://markmail.org/thread/zdxkvwt6ppq2xhv2>
> & here <http://markmail.org/thread/un7dmn7pyk7eibxz>.
> 
> I am exploring ways to do at-least-once processing of events which are
> handled in small batches as opposed to one at a time. A specific example is
> to buffer mutation ops to a non-kafka sink and align the flushing of
> batched ops with the offset commits.
> 
> The suggestions and workarounds that I have noticed in mailing lists are:
> 
> *a] Don't do it in Kafka streams, use Kafka connect. *
> 
> For the sake of this discussion, let's assume using kafka-connect isn't an
> option.
> 
> *b] In Kafka streams, use a key value state store to micro batch and
> perform a flush in punctuate method.*
> 
> The overhead seems nontrivial in this approach since a persistent key-value
> store is backed by a topic which is compacted, the keys in the state store
> will not be compaction friendly. For instance, if you use timestamp  & some
> unique id combination as key and perform range scan to find ops buffered
> since the last call to punctuate, the state store & backing Kafka topic
> will grow unbounded. Any retention applied to state store or topic would
> mean leaking implementation details, which makes this approach inelegant.
> 
> My question is since the last time this usecase was mentioned, has a better
> pattern emerged?
> 
> --
> Shrijeet
>

signature.asc
Description: OpenPGP digital signature

Re: Micro-batching in Kafka streams - redux

Reply via email to