Hey Guozhang, our basic road block is asynchronous processing (this is actually 
related to my previous post about asynchronous processing).  Here is the 
simplest use-case:

The streaming job receives messages.  Each message is a user-event and needs to 
randomly look up that user’s history (for 100’s-of-thousands of users, and each 
user can have 10’s-of-thousands of events in their history).  Once the history 
is retrieved, processing can continue.  We need processing to be as fast as 
possible and we need the ability to easily accommodate increases in incoming 
message traffic.  Here are the two designs (with KStreams, and then with 
Reactive Kafka)

KStreams Approach:

KStreams depth-first approach requires finishing processing of one message 
before the next one becomes available.  So, we would have to first estimate the 
average input message rate (and measure the performance of our app) and then 
partition our topic/s appropriately.  Each task would effectively block on 
DB-io for every history-set retrieval; obviously we would use a TTL cache 
(KTable could be useful here, but it wouldn’t be able to hold “all” of the 
history for every user).  If we need to “scale” our application, we would add 
more partitions and application processing instances.  Please suggest any other 
design choice we could go with.  I’m might be missing something.

Reactive Kafka Approach:

Reactive Kafka allows out-of-order processing.  So, while we are fetching 
history for event-1, we can start processing event-2.  In a nutshell 
Reactive-Kafka parallelism is not tightly-coupled to the number of partitions 
in the topic (obviously this doesn’t apply to the input…we can only receive 
events as fast as current partition configuration allows…but we don’t’ have to 
block on io before we receive the next message)


I’m new to both technologies, so any and all suggestions are welcome.

-David

On 7/30/16, 9:24 AM, "Guozhang Wang" <wangg...@gmail.com> wrote:

    Hello David,
    
    I'd love to learn details about the "flexibility" of Reactive Kafka
    compared with KStreams in order to see if KStreams can improve on that end.
    Would you like to elaborate a bit more on that end?
    
    Guozhang
    
    
    On Thu, Jul 28, 2016 at 12:16 PM, David Garcia <dav...@spiceworks.com>
    wrote:
    
    > Our team is evaluating KStreams and Reactive Kafka (version 0.11-M4)  on a
    > confluent 3.0 cluster.  Our testing is very simple (pulling from one 
topic,
    > doing a simple transform) and then writing out to another topic.
    >
    > The performance for the two systems is night and day. Both applications
    > were running on a laptop and connecting to kafka over a wifi network.  
Here
    > are the numbers:
    >
    > KStreams: ~14K messages per second
    > Reactive Kafka: ~110 messages per second
    >
    > Both the input, and output topic had 54 partitions.  I’m fairly certain
    > I’m not using Reactive kafka with good configuration.  Here is some 
stubbed
    > out code: https://gist.github.com/anduill/2e17cd7a40d4a86fefe19870d1270f5b
    >
    > One note, I am using the confluent stack (hence the
    > CachedSchemaRegistryClient)
    >
    > I like the flexibility of Reactive Kafka, so we’d really like to use
    > it…but if performance is going to be like this, I can’t really justify it.
    > I’m a scala/akka/streaming-akka newbie, so I’m sure there are better ways
    > to use the API.  Any help is appreciated.
    >
    > -David
    >
    
    
    
    -- 
    -- Guozhang
    

Reply via email to