Going through the release notes today - we tried fiddling with the taskmanager.memory.fraction option, going as low as 0.1 with unfortunately no success. It still leads to the container running beyond physical memory limits.
// ah From: Hailu, Andreas [Engineering] Sent: Tuesday, November 19, 2019 6:01 PM To: 'user@flink.apache.org' <user@flink.apache.org> Subject: CoGroup SortMerger performance degradation from 1.6.4 - 1.9.1? Hi, We're in the middle of testing the upgrade of our data processing flows from Flink 1.6.4 to 1.9.1. We're seeing that flows which were running just fine on 1.6.4 now fail on 1.9.1 with the same application resources and input data size. It seems that there have been some changes around how the data is sorted prior to being fed to the CoGroup operator - this is the error that we encounter: Caused by: org.apache.flink.runtime.client.JobExecutionException: Job execution failed. at org.apache.flink.runtime.jobmaster.JobResult.toJobExecutionResult(JobResult.java:146) at org.apache.flink.client.program.rest.RestClusterClient.submitJob(RestClusterClient.java:259) ... 15 more Caused by: java.lang.Exception: The data preparation for task 'CoGroup (Dataset | Merge | NONE)' , caused an error: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Lost connection to task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This indicates that the remote task manager was lost. at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:480) at org.apache.flink.runtime.operators.BatchTask.invoke(BatchTask.java:369) at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530) ... 1 more Caused by: java.lang.RuntimeException: Error obtaining the sorted input: Thread 'SortMerger Reading Thread' terminated due to an exception: Lost connection to task manager 'd73996-213.dc.gs.com/10.47.226.218:46003'. This indicates that the remote task manager was lost. at org.apache.flink.runtime.operators.sort.UnilateralSortMerger.getIterator(UnilateralSortMerger.java:650) at org.apache.flink.runtime.operators.BatchTask.getInput(BatchTask.java:1109) at org.apache.flink.runtime.operators.CoGroupDriver.prepare(CoGroupDriver.java:102) at org.apache.flink.runtime.operators.BatchTask.run(BatchTask.java:474) I drilled further down into the YARN app logs, and I found that the container was running out of physical memory: 2019-11-19 12:49:23,068 INFO org.apache.flink.yarn.YarnResourceManager - Closing TaskExecutor connection container_e42_1574076744505_9444_01_000004 because: Container [pid=42774,containerID=container_e42_1574076744505_9444_01_000004] is running beyond physical memory limits. Current usage: 12.0 GB of 12 GB physical memory used; 13.9 GB of 25.2 GB virtual memory used. Killing container. This is what leads my suspicions as this resourcing configuration worked just fine on 1.6.4 I'm working on getting heap dumps of these applications to try and get a better understanding of what's causing the blowup in physical memory required myself, but it would be helpful if anyone knew what relevant changes have been made between these versions or where else I could look? There are some features in 1.9 that we'd like to use in our flows so getting this sorted out, no pun intended, is inhibiting us from doing so. Best, Andreas ________________________________ Your Personal Data: We may collect and process information about you that may be subject to data protection laws. For more information about how we use and disclose your personal data, how we protect your information, our legal basis to use your information, your rights and who you can contact, please refer to: www.gs.com/privacy-notices<http://www.gs.com/privacy-notices>