I have a job entirely written in Flink SQL. The first part of the program
processes 10 input topics and generates one output topic with normalized
messages and some filtering applied (really easy, some where by fields and
substring). Nine of the topics produce between hundreds and thousands of
messages per second, with an average of 4–10 partitions each. The other
topic produces 150K messages per second and has 500 partitions. They are
unioned to the output topic.

The average output rate needed to avoid lag after filtering messages should
be around 60K messages per second. I’ve been testing different
configurations of parallelism, slots and pods (everything runs on
Kubernetes), but I’m far from achieving those numbers.

In the latest configuration, I used 20 pods, a parallelism of 120, with 4
slots per taskmanager. With this setup, I achieve approximately 20K
messages per second, but I’m unable to consume the largest topic at the
rate messages are being produced. Additionally, setting parallelism to 120
creates hundreds of subtasks for the smaller topics, which don’t do much
but still consume minimal resources even if idle.

I started trying with parallelism 12 and got about 1000-3000 messages per
second.

When I check the use of cpu and memory to the pods and don't see any
problem and they are far from the limit, each taskmanager has 4gb and
2cpus and they are never close to using the cpu.

It’s a requirement to do everything 100% in SQL. How can I improve the
throughput rate? Should I be concerned about the hundreds of subtasks
created for the smaller topics?

Reply via email to