Looks like re-partitioning is probably the way to go. I've seen reference
to this pattern a couple of times but wanted to make sure I wasn't missing
something obvious. Looks like kafka streams makes this kind of thing a bit
easier than samza.

Thanks for sharing your wisdom folks :-)



On Wed, Sep 7, 2016 at 1:43 PM, David Garcia <dav...@spiceworks.com> wrote:

> Obviously for the keys you don’t have, you would have to look them
> up…sorry, I kinda missed that part.  That is indeed a pain.  The job that
> looks those keys up would probably have to batch queries to the external
> system.  Maybe you could use kafka-connect-jdbc to stream in updates to
> that system?
>
> -David
>
>
> On 9/7/16, 3:41 PM, "David Garcia" <dav...@spiceworks.com> wrote:
>
>     The “simplest” way to solve this is to “repartition” your data (i.e.
> the streams you wish to join) with the partition key you wish to join on.
> This obviously introduces redundancy, but it will solve your problem.  For
> example.. suppose you want to join topic T1 and topic T2…but they aren’t
> partitioned on the key you need.  You could write two “simple” repartition
> jobs for each topic (you can actually do this with one job):
>
>     T1 -> Job_T1 -> T1’
>     T2 -> Job_T2 -> T2’
>
>     T1’ and T2’ would be partitioned on your join key and would have the
> same number of partitions so that you have the guarantees you need to do
> the join.  (i.e. join T1’ and T2’).
>
>     -David
>
>
>     On 9/2/16, 8:43 PM, "Andy Chambers" <achambers.h...@gmail.com> wrote:
>
>         Hey Folks,
>
>         We are having quite a bit trouble modelling the flow of data
> through a very
>         kafka centric system
>
>         As I understand it, every stream you might want to join with
> another must
>         be partitioned the same way. But often streams at the edges of a
> system
>         *cannot* be partitioned the same way because they don't have the
> partition
>         key yet (often the work for this process is to find the key in
> some lookup
>         table based on some other key we don't control).
>
>         We have come up with a few solutions but everything seems to add
> complexity
>         and backs our designs into a corner.
>
>         What is frustrating is that most of the data is not really that
> big but we
>         have a handful of topics we expect to require a lot of throughput.
>
>         Is this just unavoidable complexity asociated with scale or am I
> thinking
>         about this in the wrong way. We're going all in on the "turning the
>         database inside out" architecture but we end up spending more time
> thinking
>         about how stuff gets broken up into tasks and distributed than we
> are about
>         our business.
>
>         Do these problems seem familiar to anyone else?  Did you find any
> patterns
>         that helped keep the complexity down.
>
>         Cheers
>
>
>
>
>
>
>

Reply via email to