dongwoo.kim created FLINK-33715: ----------------------------------- Summary: Enhance history server to archive multiple histories per jobid Key: FLINK-33715 URL: https://issues.apache.org/jira/browse/FLINK-33715 Project: Flink Issue Type: Improvement Components: Runtime / Coordination Reporter: dongwoo.kim
Hello Flink team, I'd like to propose an improvement to how the job manager archives job histories and how flink history server fetches the history. Currently, only one job history per jobid is available to be archived and fectched. When a flink job tries to archive the job's history more than once, usually 'FileAlreadyExistsException' error happens. This makes sense in most cases, since a job typically gets a new ID when it gets restarted from latest checkpoint/savepoint. However, there's a specific situation where this behavior can be problematic: 1) When we upgrade a job using the savepoint mode, the job's first history gets successfully archived. 2) If the same job later fails due to an error, its history isn't archived again because there's already a record with the same job ID. This can be an issue because the most valuable information – why the job failed – gets lost. To simply solve this, I suggest to include currentTimeMillis to the history filename along with jobid. ( \{jobid}-\{currentTimeMillis} ) And also in the history fetching side parse jobid before the *"-"* delimiter and fetch all the histories for that jobid. For UI we can keep current display or maybe enhance with adding extra hierarchy for each jobid since each jobid can now have multiple histories. If we could reach an agreement I'll be glad to take on the implementation. Thanks in advance. -- This message was sent by Atlassian Jira (v8.20.10#820010)