Re: Questions about Kafka Streams Partitioning & Deployment

Eno Thereska Wed, 20 Jul 2016 12:38:03 -0700

Hi Michael,

These are good questions and I can confirm that the system works in the way you 
hope it works, if you use the DSL and don't make up keys arbitrarily. In other 
words, there is nothing currently that prevents you from shooting yourself in 
the foot e.g., by making up keys and using the same made-up key in different 
processing nodes.


However, if you use the Kafka Streams primitives, then such bad situations are 
not supposed to happen.

Thanks
Eno

> On 18 Jul 2016, at 11:28, Michael Ruberry <michae...@taboola.com> wrote:
> 
> Hi all,
> 
> My company, Taboola, has been looking at Kafka Streams and we are still
> confused about some details of partitions and store persistence.
> 
> So let's say we have an app consuming a single topic and we're using an
> embedded and persisted key-value store:
> 
> 
>   1. If we restart the app, will each node finish loading values from the
>   store before it begins processing? (Analogous to Samza's boot-strap
>   streams?)
>   2. How are are the store's entries partitioned among the nodes? Is there
>   a promise that if an entry in the store and the stream have the same key
>   that they will be colocated on the same node?
>   3. Another way of putting the above question: assuming we are using the
>   Processor API, if in process (key, value) we only update the store on the
>   current key, are we guaranteed that across restarts the prior data written
>   for each key will be available to the node processing that key?
>   4. Does the answer(s) to 2 and 3 change if the number of nodes varies
>   across restarts?
>   5. Does the answer(s) to 2,3 and 4 change if there are code changes
>   across restarts?
>   6. If we write an arbitrary key in process(), what will happen? What if
>   two nodes write the same key in the same store? In general, should we only
>   be accessing keys that correspond to the key given in process()?
> 
> 
> We hope the system works like this:
> 
> 
>   1. Data from the data store does act like a boot-strap stream in Samza,
>   and on restart is distributed to nodes according to the data's keys BEFORE
>   any other processing is performed
>   2. If two items, whether from a topic or a data store, have the same
>   key, then they are processed on the same node
> 
> 
> These are the critical features we're looking for, as we think they will
> allow us to (1) adjust the number of nodes and make code changes without
> worrying about data loss, (2) guarantee that data and processing are
> properly colocated, and (3) easily coordinate multiple topics and data
> stores.
> 
> Looking forward to your responses!

Re: Questions about Kafka Streams Partitioning & Deployment

Reply via email to