David,

you wrote:

> Each task would effectively block on DB-io for every history-set
retrieval;
> obviously we would use a TTL cache (KTable could be useful here, but it
> wouldn’t be able to hold “all” of the history for every user)

Can you elaborate a bit why you think a KTable wouldn't be able to hold the
full history for every user?  (Not implying I have a different opinion,
simply trying to understand.)

One advantage of using a KTable is data locality for any such lookups
during your stream-table join, i.e. when processing a new incoming record,
you wouldn't incur increased per-record processing latency because of
network round trips (while talking to a remote DB).  I'd expect this would
significantly reduce the time needed to finish processing of one new
incoming message.  There are further advantages for this setup, see [1].
So if a KTable-based approach would be suitable for your use case, I'd
consider giving that a try.


[1] http://www.confluent.io/blog/elastic-scaling-in-kafka-streams



On Sat, Jul 30, 2016 at 5:44 PM, David Garcia <dav...@spiceworks.com> wrote:

> Hey Guozhang, our basic road block is asynchronous processing (this is
> actually related to my previous post about asynchronous processing).  Here
> is the simplest use-case:
>
> The streaming job receives messages.  Each message is a user-event and
> needs to randomly look up that user’s history (for 100’s-of-thousands of
> users, and each user can have 10’s-of-thousands of events in their
> history).  Once the history is retrieved, processing can continue.  We need
> processing to be as fast as possible and we need the ability to easily
> accommodate increases in incoming message traffic.  Here are the two
> designs (with KStreams, and then with Reactive Kafka)
>
> KStreams Approach:
>
> KStreams depth-first approach requires finishing processing of one message
> before the next one becomes available.  So, we would have to first estimate
> the average input message rate (and measure the performance of our app) and
> then partition our topic/s appropriately.  Each task would effectively
> block on DB-io for every history-set retrieval; obviously we would use a
> TTL cache (KTable could be useful here, but it wouldn’t be able to hold
> “all” of the history for every user).  If we need to “scale” our
> application, we would add more partitions and application processing
> instances.  Please suggest any other design choice we could go with.  I’m
> might be missing something.
>
> Reactive Kafka Approach:
>
> Reactive Kafka allows out-of-order processing.  So, while we are fetching
> history for event-1, we can start processing event-2.  In a nutshell
> Reactive-Kafka parallelism is not tightly-coupled to the number of
> partitions in the topic (obviously this doesn’t apply to the input…we can
> only receive events as fast as current partition configuration allows…but
> we don’t’ have to block on io before we receive the next message)
>
>
> I’m new to both technologies, so any and all suggestions are welcome.
>
> -David
>
> On 7/30/16, 9:24 AM, "Guozhang Wang" <wangg...@gmail.com> wrote:
>
>     Hello David,
>
>     I'd love to learn details about the "flexibility" of Reactive Kafka
>     compared with KStreams in order to see if KStreams can improve on that
> end.
>     Would you like to elaborate a bit more on that end?
>
>     Guozhang
>
>
>     On Thu, Jul 28, 2016 at 12:16 PM, David Garcia <dav...@spiceworks.com>
>     wrote:
>
>     > Our team is evaluating KStreams and Reactive Kafka (version
> 0.11-M4)  on a
>     > confluent 3.0 cluster.  Our testing is very simple (pulling from one
> topic,
>     > doing a simple transform) and then writing out to another topic.
>     >
>     > The performance for the two systems is night and day. Both
> applications
>     > were running on a laptop and connecting to kafka over a wifi
> network.  Here
>     > are the numbers:
>     >
>     > KStreams: ~14K messages per second
>     > Reactive Kafka: ~110 messages per second
>     >
>     > Both the input, and output topic had 54 partitions.  I’m fairly
> certain
>     > I’m not using Reactive kafka with good configuration.  Here is some
> stubbed
>     > out code:
> https://gist.github.com/anduill/2e17cd7a40d4a86fefe19870d1270f5b
>     >
>     > One note, I am using the confluent stack (hence the
>     > CachedSchemaRegistryClient)
>     >
>     > I like the flexibility of Reactive Kafka, so we’d really like to use
>     > it…but if performance is going to be like this, I can’t really
> justify it.
>     > I’m a scala/akka/streaming-akka newbie, so I’m sure there are better
> ways
>     > to use the API.  Any help is appreciated.
>     >
>     > -David
>     >
>
>
>
>     --
>     -- Guozhang
>
>
>

Reply via email to