A few ideas how to start debugging this: - Try deactivating checkpoints. Without that, no work goes into persisting rocksdb data to the checkpoint store. - Try to swap RocksDB for the FsStateBackend - that reduces serialization cost for moving data between heap and offheap (rocksdb). - Do you have some expensive types (JSON, etc)? Try activating object reuse (which avoids some extra defensive copies)
On Tue, Apr 17, 2018 at 5:50 PM, TechnoMage <mla...@technomage.com> wrote: > Memory use is steady throughout the job, but the CPU utilization drops off > a cliff. I assume this is because it becomes I/O bound shuffling managed > state. > > Are there any metrics on managed state that can help in evaluating what to > do next? > > Michael > > > On Apr 17, 2018, at 7:11 AM, Michael Latta <mla...@technomage.com> wrote: > > Thanks for the suggestion. The task manager is configured for 8GB of heap, > and gets to about 8.3 total. Other java processes (job manager and Kafka). > Add a few more. I will check it again but the instances have 16GB same as > my laptop that completes the test in <90 min. > > Michael > > Sent from my iPad > > On Apr 16, 2018, at 10:53 PM, Niclas Hedhman <nic...@hedhman.org> wrote: > > > Have you checked memory usage? It could be as simple as either having > memory leaks, or aggregating more than you think (sometimes not obvious how > much is kept around in memory for longer than one first thinks). If > possible, connect FlightRecorder or similar tool and keep an eye on memory. > Additionally, I don't have AWS experience to talk of, but IF AWS swaps RAM > to disk like regular Linux, then that might be triggered if your JVM heap > is bigger than can be handled within the available RAM. > > On Tue, Apr 17, 2018 at 9:26 AM, TechnoMage <mla...@technomage.com> wrote: > >> I am doing a short Proof of Concept for using Flink and Kafka in our >> product. On my laptop I can process 10M inputs in about 90 min. On 2 >> different EC2 instances (m4.xlarge and m5.xlarge both 4core 16GB ram and >> ssd storage) I see the process hit a wall around 50min into the test and >> short of 7M events processed. This is running zookeeper, kafka broker, >> flink all on the same server in all cases. My goal is to measure single >> node vs. multi-node and test horizontal scalability, but I would like to >> figure out why hit hits a wall first. I have the task maanger configured >> with 6 slots and the job has 5 parallelism. The laptop has 8 threads, and >> the EC2 instances have 4 threads. On smaller data sets and in the begining >> of each test the EC2 instances outpace the laptop. I will try again with >> an m5.2xlarge which has 8 threads and 32GB ram to see if that works better >> for this workload. Any pointers or ways to get metrics that would help >> diagnose this would be appreciated. >> >> Michael >> >> > > > -- > Niclas Hedhman, Software Developer > http://polygene.apache.org - New Energy for Java > > >