Re: Partitioning at the edges

David Garcia Wed, 07 Sep 2016 13:44:09 -0700

Obviously for the keys you don’t have, you would have to look them up…sorry, I 
kinda missed that part.  That is indeed a pain.  The job that looks those keys 
up would probably have to batch queries to the external system.  Maybe you 
could use kafka-connect-jdbc to stream in updates to that system?


-David


On 9/7/16, 3:41 PM, "David Garcia" <dav...@spiceworks.com> wrote:

    The “simplest” way to solve this is to “repartition” your data (i.e. the 
streams you wish to join) with the partition key you wish to join on.  This 
obviously introduces redundancy, but it will solve your problem.  For example.. 
suppose you want to join topic T1 and topic T2…but they aren’t partitioned on 
the key you need.  You could write two “simple” repartition jobs for each topic 
(you can actually do this with one job):
    
    T1 -> Job_T1 -> T1’
    T2 -> Job_T2 -> T2’
    
    T1’ and T2’ would be partitioned on your join key and would have the same 
number of partitions so that you have the guarantees you need to do the join.  
(i.e. join T1’ and T2’).
    
    -David
    
    
    On 9/2/16, 8:43 PM, "Andy Chambers" <achambers.h...@gmail.com> wrote:
    
        Hey Folks,
        
        We are having quite a bit trouble modelling the flow of data through a 
very
        kafka centric system
        
        As I understand it, every stream you might want to join with another 
must
        be partitioned the same way. But often streams at the edges of a system
        *cannot* be partitioned the same way because they don't have the 
partition
        key yet (often the work for this process is to find the key in some 
lookup
        table based on some other key we don't control).
        
        We have come up with a few solutions but everything seems to add 
complexity
        and backs our designs into a corner.
        
        What is frustrating is that most of the data is not really that big but 
we
        have a handful of topics we expect to require a lot of throughput.
        
        Is this just unavoidable complexity asociated with scale or am I 
thinking
        about this in the wrong way. We're going all in on the "turning the
        database inside out" architecture but we end up spending more time 
thinking
        about how stuff gets broken up into tasks and distributed than we are 
about
        our business.
        
        Do these problems seem familiar to anyone else?  Did you find any 
patterns
        that helped keep the complexity down.
        
        Cheers

Re: Partitioning at the edges

Reply via email to