Hi,

Is it possible to perform some benchmark for the first map (not the whole
job)? Then you could get a basic understanding of whether the map
implementation is a problem. Besides the map implementation, there is also
some overhead introduced by the framework, e.g. the Java and Python process
communication overhead and serialization/deserialization. This part of
overhead depends on the message size, the type of elements, etc. To get a
basic understanding of this part of overhead, one way is to do nothing in
the map and to see the performance we could achieve in this case.

Regards,
Dian


On Wed, Nov 17, 2021 at 11:14 PM Thomas Portugal <thomasportug...@gmail.com>
wrote:

> Hello community,
> My team is developing an application using Pyflink. We are using the
> Datastream API. Basically, we read from a kafka topic, do some maps, and
> write on another kafka topic. One restriction about it is the first map,
> that has to be serialized and with parallelism equals to one. This is
> causing a bottleneck on the throughput, and we are achieving approximately
> 2k msgs/sec. Monitoring the cpu usage and the number of records on each
> operator, it seems that the first operator is causing the issue.
> The first operator is like a buffer that groups the messages from kafka
> and sends them to the next operators. We are using a dequeue from python's
> collections. Since we are stuck on this issue, could you answer some
> questions about this matter?
>
> 1 - Using data structures from python can introduce some latency or
> increase the CPU usage?
> 2 - There are alternatives to this approach? We were thinking about Window
> structure, from Flink, but in our case it's not time based, and we didn't
> find an equivalent on python API.
> 3 - Using Table API to read from Kafka Topic and do the windowing can
> improve our performance?
>
> We already set some parameters like python.fn-execution.bundle.time and
> buffer.timeout to improve our performance.
>
> Thanks for your attention.
> Best Regards
>

Reply via email to