Yu Chen created FLINK-33613: ------------------------------- Summary: Python UDF Runner process leak in Process Mode Key: FLINK-33613 URL: https://issues.apache.org/jira/browse/FLINK-33613 Project: Flink Issue Type: Bug Components: API / Python Affects Versions: 1.17.0 Reporter: Yu Chen Attachments: ps-ef.txt, streaming_word_count-1.py
While working with PyFlink, we found that in Process Mode, the Python UDF process may leak after a failover of the job. It leads to a rising number of processes with their threads in the host machine, which eventually results in failure to create new threads. You can try to reproduce it with the attached test task `streamin_word_count.py`. (Note that the job will continue failover, and you can watch the process leaks by `ps -ef` on Taskmanager. Our test environment: * K8S Application Mode * 4 Taskmanagers with 12 slots/TM * Job's parallelism was set to 48 The udf process `pyflink.fn_execution.beam.beam_boot` should be consistence with parallelism (48), but we found that there are 180 processes after several failovers. -- This message was sent by Atlassian Jira (v8.20.10#820010)