Hello community,
My team is developing an application using Pyflink. We are using the
Datastream API. Basically, we read from a kafka topic, do some maps, and
write on another kafka topic. One restriction about it is the first map,
that has to be serialized and with parallelism equals to one. This is
causing a bottleneck on the throughput, and we are achieving approximately
2k msgs/sec. Monitoring the cpu usage and the number of records on each
operator, it seems that the first operator is causing the issue.
The first operator is like a buffer that groups the messages from kafka and
sends them to the next operators. We are using a dequeue from python's
collections. Since we are stuck on this issue, could you answer some
questions about this matter?

1 - Using data structures from python can introduce some latency or
increase the CPU usage?
2 - There are alternatives to this approach? We were thinking about Window
structure, from Flink, but in our case it's not time based, and we didn't
find an equivalent on python API.
3 - Using Table API to read from Kafka Topic and do the windowing can
improve our performance?

We already set some parameters like python.fn-execution.bundle.time and
buffer.timeout to improve our performance.

Thanks for your attention.
Best Regards

Reply via email to