Obviously for the keys you don’t have, you would have to look them up…sorry, I kinda missed that part. That is indeed a pain. The job that looks those keys up would probably have to batch queries to the external system. Maybe you could use kafka-connect-jdbc to stream in updates to that system?
-David On 9/7/16, 3:41 PM, "David Garcia" <dav...@spiceworks.com> wrote: The “simplest” way to solve this is to “repartition” your data (i.e. the streams you wish to join) with the partition key you wish to join on. This obviously introduces redundancy, but it will solve your problem. For example.. suppose you want to join topic T1 and topic T2…but they aren’t partitioned on the key you need. You could write two “simple” repartition jobs for each topic (you can actually do this with one job): T1 -> Job_T1 -> T1’ T2 -> Job_T2 -> T2’ T1’ and T2’ would be partitioned on your join key and would have the same number of partitions so that you have the guarantees you need to do the join. (i.e. join T1’ and T2’). -David On 9/2/16, 8:43 PM, "Andy Chambers" <achambers.h...@gmail.com> wrote: Hey Folks, We are having quite a bit trouble modelling the flow of data through a very kafka centric system As I understand it, every stream you might want to join with another must be partitioned the same way. But often streams at the edges of a system *cannot* be partitioned the same way because they don't have the partition key yet (often the work for this process is to find the key in some lookup table based on some other key we don't control). We have come up with a few solutions but everything seems to add complexity and backs our designs into a corner. What is frustrating is that most of the data is not really that big but we have a handful of topics we expect to require a lot of throughput. Is this just unavoidable complexity asociated with scale or am I thinking about this in the wrong way. We're going all in on the "turning the database inside out" architecture but we end up spending more time thinking about how stuff gets broken up into tasks and distributed than we are about our business. Do these problems seem familiar to anyone else? Did you find any patterns that helped keep the complexity down. Cheers