Hello,

I'm writing an application to generate reports of a real time stream of
sale records (product, quantity, price, time) that I get from a websocket,
and I plan to use Kafka and Flink to do so. The result of the pipeline
should be, for instance, "the average price for a certain product in the
last 3 hours, minute by minute", "the volume (sum of quantities) for a
certain product in the last hour, second by second", etc.

I want to balance the load on a cluster given that I'm receiving way too
many records to even keep them in memory in a single computer, around
10,000 each second.  I intend to distribute them on a Kafka topic "sales",
with multiple partitions (one for each product). That way all the records
for a given product are kept on the same node but different products are
processed on different nodes.

There are some things I still don't get about how Kafka and Flink are
supposed to work together. The following questions may be rather general
and that's because I've no experience with either of them, so any
additional insight or advice you can provide is very welcome.

*1. Architecture*

Keeping all records for a given product on the same node as mentioned
should lower the latency since all consumers will handle only local data.
That works in theory only, because if I'm not wrong Flink executes the
operators over its own cluster of nodes.

How does Flink know where to process what? What happens if there are more
nodes running Flink than Kafka, or the other way around? In case I have one
pair of Kafka and Flink node on each node on the cluster (AWS), how am I
sure that nodes get the correct data (ie, only local data)?

*2. Performance*

In the case of computing the volume of a product in a sliding window and
provided the quantities are 10, 3, 6, 1, 0, 15, .... For the first window I
may need to compute 10+3+6+1, for the second 3+6+1+0, for the third
6+1+0+15, and so on. The problem here is that in each case I compute some
of the sums multiple times (6+1 for instance is computed 3 times), and it
is even worse considering a window may have thousands of records and that
some operations are not that simple (standard deviation for instance).

Is there a way to reuse the result of previous operations to avoid this
problem? Any performance improvement to apply on this cases?

*3. Topology*

The pipeline of operations to get all the information for a report is
really big. It may require an average of prices, volume, standard
deviation, etc. But some of these operations can be executed concurrently.

How do I define the workflow (topology) so that certain "steps" are
executed concurrently, while others wait for all the previous steps to be
completed before proceeding?

*4. Launch*

The producer and all the consumers have in common many Java classes, so for
simplicity I intend to create and launch them from a single
application/process, maybe creating one thread for each one if required.

Is there any problem with that? Any advantage of creating an independent
application for each producer and consumer as shown in the documentation?

-

Best regards,
Matt

Reply via email to