Hi, After running for a while , my job manager holds thousands of CLOSE_WAIT TCP connection to HDFS datanode, the number is growing up slowly, and it's likely will hit the max open file limit. My jobs checkpoint to HDFS every minute. If I run lsof -i -a -p $JMPID, I can get a tons of following output: java 9433 iot 408u IPv4 4060901898 0t0 TCP jmHost:17922->datanode:50010 (CLOSE_WAIT) java 9433 iot 409u IPv4 4061478455 0t0 TCP jmHost:52854->datanode:50010 (CLOSE_WAIT) java 9433 iot 410r IPv4 4063170767 0t0 TCP jmHost:49384->datanode:50010 (CLOSE_WAIT) java 9433 iot 411w IPv4 4063188376 0t0 TCP jmHost:50516->datanode:50010 (CLOSE_WAIT) java 9433 iot 412u IPv4 4061459881 0t0 TCP jmHost:51651->datanode:50010 (CLOSE_WAIT) java 9433 iot 413u IPv4 4063737603 0t0 TCP jmHost:31318->datanode:50010 (CLOSE_WAIT) java 9433 iot 414w IPv4 4062030625 0t0 TCP jmHost:34033->datanode:50010 (CLOSE_WAIT) java 9433 iot 415u IPv4 4062049134 0t0 TCP jmHost:35156->datanode:50010 (CLOSE_WAIT) java 9433 iot 416u IPv4 4062615550 0t0 TCP jmHost:16962->datanode:50010 (CLOSE_WAIT) java 9433 iot 417r IPv4 4063757056 0t0 TCP jmHost:32553->datanode:50010 (CLOSE_WAIT) java 9433 iot 418w IPv4 4064304789 0t0 TCP jmHost:13375->datanode:50010 (CLOSE_WAIT) java 9433 iot 419u IPv4 4062599328 0t0 TCP jmHost:15915->datanode:50010 (CLOSE_WAIT) java 9433 iot 420w IPv4 4065462963 0t0 TCP jmHost:30432->datanode:50010 (CLOSE_WAIT) java 9433 iot 421u IPv4 4067178257 0t0 TCP jmHost:28334->datanode:50010 (CLOSE_WAIT) java 9433 iot 422u IPv4 4066022066 0t0 TCP jmHost:11843->datanode:50010 (CLOSE_WAIT)
I know restarting the job manager should cleanup those connections, but I wonder if there is any better solution? Btw, I am using flink 1.4.0, and running a standalone cluster. Thanks Youjun