Looks like re-partitioning is probably the way to go. I've seen reference to this pattern a couple of times but wanted to make sure I wasn't missing something obvious. Looks like kafka streams makes this kind of thing a bit easier than samza.
Thanks for sharing your wisdom folks :-) On Wed, Sep 7, 2016 at 1:43 PM, David Garcia <dav...@spiceworks.com> wrote: > Obviously for the keys you don’t have, you would have to look them > up…sorry, I kinda missed that part. That is indeed a pain. The job that > looks those keys up would probably have to batch queries to the external > system. Maybe you could use kafka-connect-jdbc to stream in updates to > that system? > > -David > > > On 9/7/16, 3:41 PM, "David Garcia" <dav...@spiceworks.com> wrote: > > The “simplest” way to solve this is to “repartition” your data (i.e. > the streams you wish to join) with the partition key you wish to join on. > This obviously introduces redundancy, but it will solve your problem. For > example.. suppose you want to join topic T1 and topic T2…but they aren’t > partitioned on the key you need. You could write two “simple” repartition > jobs for each topic (you can actually do this with one job): > > T1 -> Job_T1 -> T1’ > T2 -> Job_T2 -> T2’ > > T1’ and T2’ would be partitioned on your join key and would have the > same number of partitions so that you have the guarantees you need to do > the join. (i.e. join T1’ and T2’). > > -David > > > On 9/2/16, 8:43 PM, "Andy Chambers" <achambers.h...@gmail.com> wrote: > > Hey Folks, > > We are having quite a bit trouble modelling the flow of data > through a very > kafka centric system > > As I understand it, every stream you might want to join with > another must > be partitioned the same way. But often streams at the edges of a > system > *cannot* be partitioned the same way because they don't have the > partition > key yet (often the work for this process is to find the key in > some lookup > table based on some other key we don't control). > > We have come up with a few solutions but everything seems to add > complexity > and backs our designs into a corner. > > What is frustrating is that most of the data is not really that > big but we > have a handful of topics we expect to require a lot of throughput. > > Is this just unavoidable complexity asociated with scale or am I > thinking > about this in the wrong way. We're going all in on the "turning the > database inside out" architecture but we end up spending more time > thinking > about how stuff gets broken up into tasks and distributed than we > are about > our business. > > Do these problems seem familiar to anyone else? Did you find any > patterns > that helped keep the complexity down. > > Cheers > > > > > > >