[ https://issues.apache.org/jira/browse/FLINK-35489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17850694#comment-17850694 ]
Nicolas Fraison edited comment on FLINK-35489 at 5/30/24 12:07 PM: ------------------------------------------------------------------- Thks [~fanrui] for the feedback it help me realise that my analysis was wrong. The issue we are facing is the JVM crashing after the [autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/] change some memory config: {code:java} Starting kubernetes-taskmanager as a console application on host flink-kafka-job-apache-right-taskmanager-1-1. Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: "result" with message agent load/premain call failed at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422 FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x78dee4] jni_FatalError+0x70 V [libjvm.so+0x88df00] JvmtiExport::post_vm_initialized()+0x240 V [libjvm.so+0xc353fc] Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac V [libjvm.so+0x79c05c] JNI_CreateJavaVM+0x7c C [libjli.so+0x3b2c] JavaMain+0x7c C [libjli.so+0x7fdc] ThreadJavaMain+0xc C [libpthread.so.0+0x7624] start_thread+0x184 {code} Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that the memory.managed.size was shrink to 0b make me thing that it was linked to missing off heap. But you are right that jvm-overhead already reserved some memory for the off heap (and we indeed have around 400 MB with that config) So looking back to the new config I've identified the issue which is on the jvm-metaspace having been shrink to 22MB while it was set at 256MB. I've done a test increasing this parameter and the TM is now able to start. For the meta space computation size I can see the autotuning computing METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace sizing. But due to the the memBudget management it ends up setting only 22MB to the metaspace ([first allocate remaining memory to the heap and then this new remaining to metaspace and finally to managed memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130]) was (Author: JIRAUSER299678): Thks [~fanrui] for the feedback it help me realise that my analysis was wrong. The issue we are facing ifs the JVM crashing after the [autotuning|https://nightlies.apache.org/flink/flink-kubernetes-operator-docs-main/docs/custom-resource/autotuning/] change some memory config: {code:java} Starting kubernetes-taskmanager as a console application on host flink-kafka-job-apache-right-taskmanager-1-1. Exception in thread "main" *** java.lang.instrument ASSERTION FAILED ***: "result" with message agent load/premain call failed at src/java.instrument/share/native/libinstrument/JPLISAgent.c line: 422 FATAL ERROR in native method: processing of -javaagent failed, processJavaStart failed Native frames: (J=compiled Java code, A=aot compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0x78dee4] jni_FatalError+0x70 V [libjvm.so+0x88df00] JvmtiExport::post_vm_initialized()+0x240 V [libjvm.so+0xc353fc] Threads::create_vm(JavaVMInitArgs*, bool*)+0x7ac V [libjvm.so+0x79c05c] JNI_CreateJavaVM+0x7c C [libjli.so+0x3b2c] JavaMain+0x7c C [libjli.so+0x7fdc] ThreadJavaMain+0xc C [libpthread.so.0+0x7624] start_thread+0x184 {code} Seeing this big increase of HEAP (from 1.5 to more than 3GB and the fact that the memory.managed.size was shrink to 0b make me thing that it was linked to missing off heap. But you are right that jvm-overhead already reserved some memory for the off heap (and we indeed have around 400 MB with that config) So looking back to the new config I've identified the issue which is on the jvm-metaspace having been shrink to 22MB while it was set at 256MB. I've done a test increasing this parameter and the TM is now able to start. For the meta space computation size I can see the autotuning computing METASPACE_MEMORY_USED=1.41521584E8 which seems to be appropriate metaspace sizing. But due to the the memBudget management it ends up setting only 22MB to the metaspace ([first allocate remaining memory to the heap and then this new remaining to metaspace and finally to managed memory|https://github.com/apache/flink-kubernetes-operator/blob/main/flink-autoscaler/src/main/java/org/apache/flink/autoscaler/tuning/MemoryTuning.java#L130]) > Metaspace size can be too little after autotuning change memory setting > ----------------------------------------------------------------------- > > Key: FLINK-35489 > URL: https://issues.apache.org/jira/browse/FLINK-35489 > Project: Flink > Issue Type: Improvement > Components: Kubernetes Operator > Affects Versions: 1.8.0 > Reporter: Nicolas Fraison > Priority: Major > > We have enable the autotuning feature on one of our flink job with below > config > {code:java} > # Autoscaler configuration > job.autoscaler.enabled: "true" > job.autoscaler.stabilization.interval: 1m > job.autoscaler.metrics.window: 10m > job.autoscaler.target.utilization: "0.8" > job.autoscaler.target.utilization.boundary: "0.1" > job.autoscaler.restart.time: 2m > job.autoscaler.catch-up.duration: 10m > job.autoscaler.memory.tuning.enabled: true > job.autoscaler.memory.tuning.overhead: 0.5 > job.autoscaler.memory.tuning.maximize-managed-memory: true{code} > During a scale down the autotuning decided to give all the memory to to JVM > (having heap being scale by 2) settting taskmanager.memory.managed.size to 0b. > Here is the config that was compute by the autotuning for a TM running on a > 4GB pod: > {code:java} > taskmanager.memory.network.max: 4063232b > taskmanager.memory.network.min: 4063232b > taskmanager.memory.jvm-overhead.max: 433791712b > taskmanager.memory.task.heap.size: 3699934605b > taskmanager.memory.framework.off-heap.size: 134217728b > taskmanager.memory.jvm-metaspace.size: 22960020b > taskmanager.memory.framework.heap.size: "0 bytes" > taskmanager.memory.flink.size: 3838215565b > taskmanager.memory.managed.size: 0b {code} > This has lead to some issue starting the TM because we are relying on some > javaagent performing some memory allocation outside of the JVM (rely on some > C bindings). > Tuning the overhead or disabling the scale-down-compensation.enabled could > have helped for that particular event but this can leads to other issue as it > could leads to too little HEAP size being computed. > It would be interesting to be able to set a min memory.managed.size to be > taken in account by the autotuning. > What do you think about this? Do you think that some other specific config > should have been applied to avoid this issue? > > Edit see this comment that leads to the metaspace issue: > https://issues.apache.org/jira/browse/FLINK-35489?focusedCommentId=17850694&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17850694 -- This message was sent by Atlassian Jira (v8.20.10#820010)