[ https://issues.apache.org/jira/browse/FLINK-38344?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
RocMarshal updated FLINK-38344: ------------------------------- Description: When the {{{}historyserver.web.tmpdir{}}}configuration points to a non-system temporary directory, the contents of this directory will only be cleaned up if explicitly deleted. Under the current cleanup logic, this directory is cleared in the following two scenarios: 1. {*}When the HistoryServer encounters an exception{*}, it actively cleans up this directory. However, if the HistoryServer process is forcibly terminated externally, this cleanup logic will not be triggered. !image-2025-09-11-00-31-26-595.png! !image-2025-09-11-00-32-25-793.png! 2. {*}The {{{}HistoryServerArchiveFetcher{}}}{*} builds {{{}cachedArchivesPerRefreshDirectory{}}}based on the job information still present in the remote directory and uses this to determine which local job files need cleanup. Consequently, if the HistoryServer retains a large number of local job files that no longer exist in remote storage, these files will never be deleted. This may lead to excessive file handle usage on the local node, resulting in file descriptor leaks. !image-2025-09-11-00-34-54-580.png! A relatively straightforward fix would be: In the HistoryServer constructor, first clear all files in the {{{}historyserver.web.tmpdir{}}}directory before proceeding with the original initialization logic. This ensures that the local files marked for cleanup—based on {{{}HistoryServerArchiveFetcher#cachedArchivesPerRefreshDirectory{}}}—are free from leaks. I'd like to fix it. was: When the {{{}historyserver.web.tmpdir{}}}configuration points to a non-system temporary directory, the contents of this directory will only be cleaned up if explicitly deleted. Under the current cleanup logic, this directory is cleared in the following two scenarios: 1. {*}When the HistoryServer encounters an exception{*}, it actively cleans up this directory. However, if the HistoryServer process is forcibly terminated externally, this cleanup logic will not be triggered. !image-2025-09-11-00-31-26-595.png! !image-2025-09-11-00-32-25-793.png! 2. {*}The {{{}HistoryServerArchiveFetcher{}}}{*} builds {{{}cachedArchivesPerRefreshDirectory{}}}based on the job information still present in the remote directory and uses this to determine which local job files need cleanup. Consequently, if the HistoryServer retains a large number of local job files that no longer exist in remote storage, these files will never be deleted. This may lead to excessive file handle usage on the local node, resulting in file descriptor leaks. !image-2025-09-11-00-34-54-580.png! A relatively straightforward fix would be: In the HistoryServer constructor, first clear all files in the {{{}historyserver.web.tmpdir{}}}directory before proceeding with the original initialization logic. This ensures that the local files marked for cleanup—based on {{{}HistoryServerArchiveFetcher#cachedArchivesPerRefreshDirectory{}}}—are free from leaks. > The local files of the HistoryServer may risk never being deleted. > ------------------------------------------------------------------ > > Key: FLINK-38344 > URL: https://issues.apache.org/jira/browse/FLINK-38344 > Project: Flink > Issue Type: Bug > Components: Runtime / Web Frontend > Reporter: RocMarshal > Priority: Minor > Attachments: image-2025-09-11-00-31-26-595.png, > image-2025-09-11-00-32-25-793.png, image-2025-09-11-00-34-54-580.png > > > When the {{{}historyserver.web.tmpdir{}}}configuration points to a non-system > temporary directory, the contents of this directory will only be cleaned up > if explicitly deleted. > Under the current cleanup logic, this directory is cleared in the following > two scenarios: > 1. > {*}When the HistoryServer encounters an exception{*}, it actively cleans > up this directory. However, if the HistoryServer process is forcibly > terminated externally, this cleanup logic will not be triggered. > !image-2025-09-11-00-31-26-595.png! > > !image-2025-09-11-00-32-25-793.png! > > 2. > {*}The {{{}HistoryServerArchiveFetcher{}}}{*} builds > {{{}cachedArchivesPerRefreshDirectory{}}}based on the job information still > present in the remote directory and uses this to determine which local job > files need cleanup. Consequently, if the HistoryServer retains a large number > of local job files that no longer exist in remote storage, these files will > never be deleted. This may lead to excessive file handle usage on the local > node, resulting in file descriptor leaks. > !image-2025-09-11-00-34-54-580.png! > > > > > A relatively straightforward fix would be: > In the HistoryServer constructor, first clear all files in the > {{{}historyserver.web.tmpdir{}}}directory before proceeding with the original > initialization logic. This ensures that the local files marked for > cleanup—based on > {{{}HistoryServerArchiveFetcher#cachedArchivesPerRefreshDirectory{}}}—are > free from leaks. > I'd like to fix it. -- This message was sent by Atlassian Jira (v8.20.10#820010)