We actually don't have a kafka cluster setup yet at all.  Right now just
have 8 of our application servers.  We currently sample some impressions
and then dedupe/count outside at a different DC, but are looking to try to
analyze all impressions for some overall analytics.

Our requests are around 100-200 bytes each.  If we lost some of them due to
network jitter etc. it would be fine we're trying to just get overall a
rough count of each attribute.  Creating batched messages definitely makes
sense and will also cut down on the network IO.

We're trying to determine the required setup for Kafka to do what we're
looking to do as these are physical servers so we'll most likely need to
buy new hardware.  For the first run I think we'll try it out on one of our
application clusters that get a smaller amount traffic (300-400k req/sec)
and run the kafka cluster on the same machines as the applications.

So would the best route here be something like each application server
batches requests, send it to kafka, have a stream consumer that then
tallies up the totals per attribute that we want to track, output that to a
new topic, which then goes to a sink to either a DB or something like S3
which then we read into our external DBs?

Thanks!

On Sun, Mar 4, 2018 at 12:31 AM, Thakrar, Jayesh <
jthak...@conversantmedia.com> wrote:

> Matt,
>
> If I understand correctly, you have an 8 node Kafka cluster and need to
> support  about 1 million requests/sec into the cluster from source servers
> and expect to consume that for aggregation.
>
> How big are your msgs?
>
> I would suggest looking into batching multiple requests per single Kafka
> msg to achieve desired throughput.
>
> So e.g. on the request receiving systems, I would suggest creating a
> logical avro file (byte buffer) of say N requests and then making that into
> one Kafka msg payload.
>
> We have a similar situation (https://www.slideshare.net/JayeshThakrar/
> apacheconflumekafka2016) and found anything from 4x to 10x better
> throughput with batching as compared to one request per msg.
> We have different kinds of msgs/topics and the individual "request" size
> varies from  about 100 bytes to 1+ KB.
>
> On 3/2/18, 8:24 AM, "Matt Daum" <m...@setfive.com> wrote:
>
>     I am new to Kafka but I think I have a good use case for it.  I am
> trying
>     to build daily counts of requests based on a number of different
> attributes
>     in a high throughput system (~1 million requests/sec. across all  8
>     servers).  The different attributes are unbounded in terms of values,
> and
>     some will spread across 100's of millions values.  This is my current
>     through process, let me know where I could be more efficient or if
> there is
>     a better way to do it.
>
>     I'll create an AVRO object "Impression" which has all the attributes
> of the
>     inbound request.  My application servers then will on each request
> create
>     and send this to a single kafka topic.
>
>     I'll then have a consumer which creates a stream from the topic.  From
>     there I'll use the windowed timeframes and groupBy to group by the
>     attributes on each given day.  At the end of the day I'd need to read
> out
>     the data store to an external system for storage.  Since I won't know
> all
>     the values I'd need something similar to the KVStore.all() but for
>     WindowedKV Stores.  This appears that it'd be possible in 1.1 with this
>     commit:
>     https://github.com/apache/kafka/commit/1d1c8575961bf6bce7decb049be7f1
> 0ca76bd0c5
>     .
>
>     Is this the best approach to doing this?  Or would I be better using
> the
>     stream to listen and then an external DB like Aerospike to store the
> counts
>     and read out of it directly end of day.
>
>     Thanks for the help!
>     Daum
>
>
>

Reply via email to