[ https://issues.apache.org/jira/browse/FLINK-16142?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17041683#comment-17041683 ]
Thomas Wozniakowski commented on FLINK-16142: --------------------------------------------- Hi [~sewen], here is the first chunk of the logs with all the config parts: {code} Starting Task Manager config file: jobmanager.rpc.address: pattern-detector-e2e-jobmanager jobmanager.rpc.port: 6123 jobmanager.heap.size: 1024m taskmanager.memory.process.size: 1568m taskmanager.numberOfTaskSlots: 2 parallelism.default: 1 jobmanager.execution.failover-strategy: region blob.server.port: 6124 query.server.port: 6125 Starting taskexecutor as a console application on host 1ef836eff98e. 2020-02-21 08:46:50,418 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -------------------------------------------------------------------------------- 2020-02-21 08:46:50,422 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Preconfiguration: 2020-02-21 08:46:50,423 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - TM_RESOURCES_JVM_PARAMS extraction logs: - Loading configuration property: jobmanager.rpc.address, pattern-detector-e2e-jobmanager - Loading configuration property: jobmanager.rpc.port, 6123 - Loading configuration property: jobmanager.heap.size, 1024m - Loading configuration property: taskmanager.memory.process.size, 1568m - Loading configuration property: taskmanager.numberOfTaskSlots, 2 - Loading configuration property: parallelism.default, 1 - Loading configuration property: jobmanager.execution.failover-strategy, region - Loading configuration property: blob.server.port, 6124 - Loading configuration property: query.server.port, 6125 - The derived from fraction jvm overhead memory (156.800mb (164416719 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead BASH_JAVA_UTILS_EXEC_RESULT:-Xmx536870902 -Xms536870902 -XX:MaxDirectMemorySize=268435458 -XX:MaxMetaspaceSize=100663296 TM_RESOURCES_DYNAMIC_CONFIGS extraction logs: - Loading configuration property: jobmanager.rpc.address, pattern-detector-e2e-jobmanager - Loading configuration property: jobmanager.rpc.port, 6123 - Loading configuration property: jobmanager.heap.size, 1024m - Loading configuration property: taskmanager.memory.process.size, 1568m - Loading configuration property: taskmanager.numberOfTaskSlots, 2 - Loading configuration property: parallelism.default, 1 - Loading configuration property: jobmanager.execution.failover-strategy, region - Loading configuration property: blob.server.port, 6124 - Loading configuration property: query.server.port, 6125 - The derived from fraction jvm overhead memory (156.800mb (164416719 bytes)) is less than its min value 192.000mb (201326592 bytes), min value will be used instead BASH_JAVA_UTILS_EXEC_RESULT:-D taskmanager.memory.framework.off-heap.size=134217728b -D taskmanager.memory.network.max=134217730b -D taskmanager.memory.network.min=134217730b -D taskmanager.memory.framework.heap.size=134217728b -D taskmanager.memory.managed.size=536870920b -D taskmanager.cpu.cores=2.0 -D taskmanager.memory.task.heap.size=402653174b -D taskmanager.memory.task.off-heap.size=0b 2020-02-21 08:46:50,423 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -------------------------------------------------------------------------------- 2020-02-21 08:46:50,424 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Starting TaskManager (Version: 1.10.0, Rev:aa4eb8f, Date:07.02.2020 @ 19:18:19 CET) 2020-02-21 08:46:50,425 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - OS current user: flink 2020-02-21 08:46:50,426 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Current Hadoop/Kerberos user: <no hadoop dependency found> 2020-02-21 08:46:50,426 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM: OpenJDK 64-Bit Server VM - Oracle Corporation - 1.8/25.242-b08 2020-02-21 08:46:50,426 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum heap size: 512 MiBytes 2020-02-21 08:46:50,427 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JAVA_HOME: /usr/local/openjdk-8 2020-02-21 08:46:50,427 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - No Hadoop Dependency available 2020-02-21 08:46:50,428 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - JVM Options: 2020-02-21 08:46:50,428 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:+UseG1GC 2020-02-21 08:46:50,428 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xmx536870902 2020-02-21 08:46:50,428 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Xms536870902 2020-02-21 08:46:50,429 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxDirectMemorySize=268435458 2020-02-21 08:46:50,429 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -XX:MaxMetaspaceSize=100663296 2020-02-21 08:46:50,429 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlog4j.configuration=file:/opt/flink/conf/log4j-console.properties 2020-02-21 08:46:50,429 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -Dlogback.configurationFile=file:/opt/flink/conf/logback-console.xml 2020-02-21 08:46:50,430 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Program Arguments: 2020-02-21 08:46:50,430 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - --configDir 2020-02-21 08:46:50,430 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - /opt/flink/conf 2020-02-21 08:46:50,430 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -D 2020-02-21 08:46:50,431 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - taskmanager.memory.framework.off-heap.size=134217728b 2020-02-21 08:46:50,431 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -D 2020-02-21 08:46:50,431 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - taskmanager.memory.network.max=134217730b 2020-02-21 08:46:50,432 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -D 2020-02-21 08:46:50,432 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - taskmanager.memory.network.min=134217730b 2020-02-21 08:46:50,432 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -D 2020-02-21 08:46:50,432 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - taskmanager.memory.framework.heap.size=134217728b 2020-02-21 08:46:50,433 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -D 2020-02-21 08:46:50,433 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - taskmanager.memory.managed.size=536870920b 2020-02-21 08:46:50,433 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -D 2020-02-21 08:46:50,434 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - taskmanager.cpu.cores=2.0 2020-02-21 08:46:50,434 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -D 2020-02-21 08:46:50,434 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - taskmanager.memory.task.heap.size=402653174b 2020-02-21 08:46:50,434 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -D 2020-02-21 08:46:50,435 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - taskmanager.memory.task.off-heap.size=0b 2020-02-21 08:46:50,435 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Classpath: /opt/flink/lib/flink-table-blink_2.11-1.10.0.jar:/opt/flink/lib/flink-table_2.11-1.10.0.jar:/opt/flink/lib/log4j-1.2.17.jar:/opt/flink/lib/slf4j-log4j12-1.7.15.jar:/opt/flink/lib/flink-dist_2.11-1.10.0.jar::: 2020-02-21 08:46:50,435 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - -------------------------------------------------------------------------------- 2020-02-21 08:46:50,438 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Registered UNIX signal handlers for [TERM, HUP, INT] 2020-02-21 08:46:50,448 INFO org.apache.flink.runtime.taskexecutor.TaskManagerRunner - Maximum number of open file descriptors is 1048576. 2020-02-21 08:46:50,487 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.address, pattern-detector-e2e-jobmanager 2020-02-21 08:46:50,487 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.rpc.port, 6123 2020-02-21 08:46:50,487 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.heap.size, 1024m 2020-02-21 08:46:50,488 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.memory.process.size, 1568m 2020-02-21 08:46:50,489 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: taskmanager.numberOfTaskSlots, 2 2020-02-21 08:46:50,489 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: parallelism.default, 1 2020-02-21 08:46:50,490 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: jobmanager.execution.failover-strategy, region 2020-02-21 08:46:50,492 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: blob.server.port, 6124 2020-02-21 08:46:50,493 INFO org.apache.flink.configuration.GlobalConfiguration - Loading configuration property: query.server.port, 6125 {code} > Memory Leak causes Metaspace OOM error on repeated job submission > ----------------------------------------------------------------- > > Key: FLINK-16142 > URL: https://issues.apache.org/jira/browse/FLINK-16142 > Project: Flink > Issue Type: Bug > Components: Client / Job Submission > Affects Versions: 1.10.0 > Reporter: Thomas Wozniakowski > Priority: Blocker > Fix For: 1.10.1, 1.11.0 > > > Hi Guys, > We've just tried deploying 1.10.0 as it has lots of shiny stuff that fits our > use-case exactly (RocksDB state backend running in a containerised cluster). > Unfortunately, it seems like there is a memory leak somewhere in the job > submission logic. We are getting this error: > {code:java} > 2020-02-18 10:22:10,020 INFO > org.apache.flink.runtime.executiongraph.ExecutionGraph - OPERATOR_NAME > switched from RUNNING to FAILED. > java.lang.OutOfMemoryError: Metaspace > at java.lang.ClassLoader.defineClass1(Native Method) > at java.lang.ClassLoader.defineClass(ClassLoader.java:757) > at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) > at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) > at java.net.URLClassLoader.access$100(URLClassLoader.java:74) > at java.net.URLClassLoader$1.run(URLClassLoader.java:369) > at java.net.URLClassLoader$1.run(URLClassLoader.java:363) > at java.security.AccessController.doPrivileged(Native Method) > at java.net.URLClassLoader.findClass(URLClassLoader.java:362) > at java.lang.ClassLoader.loadClass(ClassLoader.java:419) > at > org.apache.flink.util.ChildFirstClassLoader.loadClass(ChildFirstClassLoader.java:60) > at java.lang.ClassLoader.loadClass(ClassLoader.java:352) > at > org.apache.flink.kinesis.shaded.com.amazonaws.jmx.SdkMBeanRegistrySupport.registerMetricAdminMBean(SdkMBeanRegistrySupport.java:27) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.registerMetricAdminMBean(AwsSdkMetrics.java:398) > at > org.apache.flink.kinesis.shaded.com.amazonaws.metrics.AwsSdkMetrics.<clinit>(AwsSdkMetrics.java:359) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.requestMetricCollector(AmazonWebServiceClient.java:728) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRMCEnabledAtClientOrSdkLevel(AmazonWebServiceClient.java:660) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.isRequestMetricsEnabled(AmazonWebServiceClient.java:652) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:611) > at > org.apache.flink.kinesis.shaded.com.amazonaws.AmazonWebServiceClient.createExecutionContext(AmazonWebServiceClient.java:606) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.executeListShards(AmazonKinesisClient.java:1534) > at > org.apache.flink.kinesis.shaded.com.amazonaws.services.kinesis.AmazonKinesisClient.listShards(AmazonKinesisClient.java:1528) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.listShards(KinesisProxy.java:439) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardsOfStream(KinesisProxy.java:389) > at > org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.getShardList(KinesisProxy.java:279) > at > org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.discoverNewShardsToSubscribe(KinesisDataFetcher.java:686) > at > org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:287) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:100) > at > org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:63) > {code} > (The only change in the above text is the OPERATOR_NAME text where I removed > some of the internal specifics of our system). > This will reliably happen on a fresh cluster after submitting and cancelling > our job 3 times. > We are using the presto-s3 plugin, the CEP library and the Kinesis connector. > Please let me know what other diagnostics would be useful. > Tom -- This message was sent by Atlassian Jira (v8.3.4#803005)