Have you checked memory usage? It could be as simple as either having memory leaks, or aggregating more than you think (sometimes not obvious how much is kept around in memory for longer than one first thinks). If possible, connect FlightRecorder or similar tool and keep an eye on memory. Additionally, I don't have AWS experience to talk of, but IF AWS swaps RAM to disk like regular Linux, then that might be triggered if your JVM heap is bigger than can be handled within the available RAM.
On Tue, Apr 17, 2018 at 9:26 AM, TechnoMage <mla...@technomage.com> wrote: > I am doing a short Proof of Concept for using Flink and Kafka in our > product. On my laptop I can process 10M inputs in about 90 min. On 2 > different EC2 instances (m4.xlarge and m5.xlarge both 4core 16GB ram and > ssd storage) I see the process hit a wall around 50min into the test and > short of 7M events processed. This is running zookeeper, kafka broker, > flink all on the same server in all cases. My goal is to measure single > node vs. multi-node and test horizontal scalability, but I would like to > figure out why hit hits a wall first. I have the task maanger configured > with 6 slots and the job has 5 parallelism. The laptop has 8 threads, and > the EC2 instances have 4 threads. On smaller data sets and in the begining > of each test the EC2 instances outpace the laptop. I will try again with > an m5.2xlarge which has 8 threads and 32GB ram to see if that works better > for this workload. Any pointers or ways to get metrics that would help > diagnose this would be appreciated. > > Michael > > -- Niclas Hedhman, Software Developer http://zest.apache.org - New Energy for Java