David, you wrote:
> Each task would effectively block on DB-io for every history-set retrieval; > obviously we would use a TTL cache (KTable could be useful here, but it > wouldn’t be able to hold “all” of the history for every user) Can you elaborate a bit why you think a KTable wouldn't be able to hold the full history for every user? (Not implying I have a different opinion, simply trying to understand.) One advantage of using a KTable is data locality for any such lookups during your stream-table join, i.e. when processing a new incoming record, you wouldn't incur increased per-record processing latency because of network round trips (while talking to a remote DB). I'd expect this would significantly reduce the time needed to finish processing of one new incoming message. There are further advantages for this setup, see [1]. So if a KTable-based approach would be suitable for your use case, I'd consider giving that a try. [1] http://www.confluent.io/blog/elastic-scaling-in-kafka-streams On Sat, Jul 30, 2016 at 5:44 PM, David Garcia <dav...@spiceworks.com> wrote: > Hey Guozhang, our basic road block is asynchronous processing (this is > actually related to my previous post about asynchronous processing). Here > is the simplest use-case: > > The streaming job receives messages. Each message is a user-event and > needs to randomly look up that user’s history (for 100’s-of-thousands of > users, and each user can have 10’s-of-thousands of events in their > history). Once the history is retrieved, processing can continue. We need > processing to be as fast as possible and we need the ability to easily > accommodate increases in incoming message traffic. Here are the two > designs (with KStreams, and then with Reactive Kafka) > > KStreams Approach: > > KStreams depth-first approach requires finishing processing of one message > before the next one becomes available. So, we would have to first estimate > the average input message rate (and measure the performance of our app) and > then partition our topic/s appropriately. Each task would effectively > block on DB-io for every history-set retrieval; obviously we would use a > TTL cache (KTable could be useful here, but it wouldn’t be able to hold > “all” of the history for every user). If we need to “scale” our > application, we would add more partitions and application processing > instances. Please suggest any other design choice we could go with. I’m > might be missing something. > > Reactive Kafka Approach: > > Reactive Kafka allows out-of-order processing. So, while we are fetching > history for event-1, we can start processing event-2. In a nutshell > Reactive-Kafka parallelism is not tightly-coupled to the number of > partitions in the topic (obviously this doesn’t apply to the input…we can > only receive events as fast as current partition configuration allows…but > we don’t’ have to block on io before we receive the next message) > > > I’m new to both technologies, so any and all suggestions are welcome. > > -David > > On 7/30/16, 9:24 AM, "Guozhang Wang" <wangg...@gmail.com> wrote: > > Hello David, > > I'd love to learn details about the "flexibility" of Reactive Kafka > compared with KStreams in order to see if KStreams can improve on that > end. > Would you like to elaborate a bit more on that end? > > Guozhang > > > On Thu, Jul 28, 2016 at 12:16 PM, David Garcia <dav...@spiceworks.com> > wrote: > > > Our team is evaluating KStreams and Reactive Kafka (version > 0.11-M4) on a > > confluent 3.0 cluster. Our testing is very simple (pulling from one > topic, > > doing a simple transform) and then writing out to another topic. > > > > The performance for the two systems is night and day. Both > applications > > were running on a laptop and connecting to kafka over a wifi > network. Here > > are the numbers: > > > > KStreams: ~14K messages per second > > Reactive Kafka: ~110 messages per second > > > > Both the input, and output topic had 54 partitions. I’m fairly > certain > > I’m not using Reactive kafka with good configuration. Here is some > stubbed > > out code: > https://gist.github.com/anduill/2e17cd7a40d4a86fefe19870d1270f5b > > > > One note, I am using the confluent stack (hence the > > CachedSchemaRegistryClient) > > > > I like the flexibility of Reactive Kafka, so we’d really like to use > > it…but if performance is going to be like this, I can’t really > justify it. > > I’m a scala/akka/streaming-akka newbie, so I’m sure there are better > ways > > to use the API. Any help is appreciated. > > > > -David > > > > > > -- > -- Guozhang > > >