I am doing a short Proof of Concept for using Flink and Kafka in our product. On my laptop I can process 10M inputs in about 90 min. On 2 different EC2 instances (m4.xlarge and m5.xlarge both 4core 16GB ram and ssd storage) I see the process hit a wall around 50min into the test and short of 7M events processed. This is running zookeeper, kafka broker, flink all on the same server in all cases. My goal is to measure single node vs. multi-node and test horizontal scalability, but I would like to figure out why hit hits a wall first. I have the task maanger configured with 6 slots and the job has 5 parallelism. The laptop has 8 threads, and the EC2 instances have 4 threads. On smaller data sets and in the begining of each test the EC2 instances outpace the laptop. I will try again with an m5.2xlarge which has 8 threads and 32GB ram to see if that works better for this workload. Any pointers or ways to get metrics that would help diagnose this would be appreciated.
Michael