Hey Guozhang, our basic road block is asynchronous processing (this is actually related to my previous post about asynchronous processing). Here is the simplest use-case:
The streaming job receives messages. Each message is a user-event and needs to randomly look up that user’s history (for 100’s-of-thousands of users, and each user can have 10’s-of-thousands of events in their history). Once the history is retrieved, processing can continue. We need processing to be as fast as possible and we need the ability to easily accommodate increases in incoming message traffic. Here are the two designs (with KStreams, and then with Reactive Kafka) KStreams Approach: KStreams depth-first approach requires finishing processing of one message before the next one becomes available. So, we would have to first estimate the average input message rate (and measure the performance of our app) and then partition our topic/s appropriately. Each task would effectively block on DB-io for every history-set retrieval; obviously we would use a TTL cache (KTable could be useful here, but it wouldn’t be able to hold “all” of the history for every user). If we need to “scale” our application, we would add more partitions and application processing instances. Please suggest any other design choice we could go with. I’m might be missing something. Reactive Kafka Approach: Reactive Kafka allows out-of-order processing. So, while we are fetching history for event-1, we can start processing event-2. In a nutshell Reactive-Kafka parallelism is not tightly-coupled to the number of partitions in the topic (obviously this doesn’t apply to the input…we can only receive events as fast as current partition configuration allows…but we don’t’ have to block on io before we receive the next message) I’m new to both technologies, so any and all suggestions are welcome. -David On 7/30/16, 9:24 AM, "Guozhang Wang" <wangg...@gmail.com> wrote: Hello David, I'd love to learn details about the "flexibility" of Reactive Kafka compared with KStreams in order to see if KStreams can improve on that end. Would you like to elaborate a bit more on that end? Guozhang On Thu, Jul 28, 2016 at 12:16 PM, David Garcia <dav...@spiceworks.com> wrote: > Our team is evaluating KStreams and Reactive Kafka (version 0.11-M4) on a > confluent 3.0 cluster. Our testing is very simple (pulling from one topic, > doing a simple transform) and then writing out to another topic. > > The performance for the two systems is night and day. Both applications > were running on a laptop and connecting to kafka over a wifi network. Here > are the numbers: > > KStreams: ~14K messages per second > Reactive Kafka: ~110 messages per second > > Both the input, and output topic had 54 partitions. I’m fairly certain > I’m not using Reactive kafka with good configuration. Here is some stubbed > out code: https://gist.github.com/anduill/2e17cd7a40d4a86fefe19870d1270f5b > > One note, I am using the confluent stack (hence the > CachedSchemaRegistryClient) > > I like the flexibility of Reactive Kafka, so we’d really like to use > it…but if performance is going to be like this, I can’t really justify it. > I’m a scala/akka/streaming-akka newbie, so I’m sure there are better ways > to use the API. Any help is appreciated. > > -David > -- -- Guozhang