After last checking it uses about 200-400 millicores each pod and 2.2Gb. El mié, 29 ene 2025 a las 21:41, Guillermo Ortiz Fernández (< guillermo.ortiz.f...@gmail.com>) escribió:
> I have a job entirely written in Flink SQL. The first part of the program > processes 10 input topics and generates one output topic with normalized > messages and some filtering applied (really easy, some where by fields and > substring). Nine of the topics produce between hundreds and thousands of > messages per second, with an average of 4–10 partitions each. The other > topic produces 150K messages per second and has 500 partitions. They are > unioned to the output topic. > > The average output rate needed to avoid lag after filtering messages > should be around 60K messages per second. I’ve been testing different > configurations of parallelism, slots and pods (everything runs on > Kubernetes), but I’m far from achieving those numbers. > > In the latest configuration, I used 20 pods, a parallelism of 120, with 4 > slots per taskmanager. With this setup, I achieve approximately 20K > messages per second, but I’m unable to consume the largest topic at the > rate messages are being produced. Additionally, setting parallelism to 120 > creates hundreds of subtasks for the smaller topics, which don’t do much > but still consume minimal resources even if idle. > > I started trying with parallelism 12 and got about 1000-3000 messages per > second. > > When I check the use of cpu and memory to the pods and don't see any > problem and they are far from the limit, each taskmanager has 4gb and > 2cpus and they are never close to using the cpu. > > It’s a requirement to do everything 100% in SQL. How can I improve the > throughput rate? Should I be concerned about the hundreds of subtasks > created for the smaller topics? >