[ https://issues.apache.org/jira/browse/FLINK-13633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901981#comment-16901981 ]
TisonKun commented on FLINK-13633: ---------------------------------- Thanks for filing this issue [~fly_in_gis]. I think it is reasonable to put persisted job graphs and checkpoints under cluster-id sub dir of ha storage. To be clear, current layout is {code:java} └── <high-availability.storageDir> ├── submittedJobGraph ├ ├ jobgraph1 ├ ├ jobgraph2 ├── completedCheckpoint ├ ├ checkpoint1 ├ ├ checkpoint2 ├ ├ checkpoint3 ├── <high-availability.cluster-id> ├── blob1 ├── blob2 {code} and the proposed layout is {code:java} └── <high-availability.storageDir> ├── <high-availability.cluster-id> ├── submittedJobGraph ├ ├ jobgraph1 ├ ├ jobgraph2 ├── completedCheckpoint ├ ├ checkpoint1 ├ ├ checkpoint2 ├ ├ checkpoint3 ├── blob1 ├── blob2 {code} It helps clean up the storage per cluster on shutdown indeed. Here are some questions 1. Since we retrieve job graph and checkpoint by state handler, is the sub dir "submittedJobGraph" and "completedCheckpoint" necessary or helpful(maybe for human readability)? What about a flatten storage layout?(i.e., all files under directly {{<storageDir>/<cluster-id>/}}) 2. From another angle, what if move all blobs under {{<storageDir>/<cluster-id>/blobs/}}? > Move submittedJobGraph and completedCheckpoint to cluster-id subdirectory of > high-availability storage > ------------------------------------------------------------------------------------------------------- > > Key: FLINK-13633 > URL: https://issues.apache.org/jira/browse/FLINK-13633 > Project: Flink > Issue Type: New Feature > Reporter: Yang Wang > Priority: Major > > Currently, if we enable the high-availability, the ha storage directory > structure is stored as below. The submittedJobGraph and completedCheckpoint > are directly stored under the ha storage path. It is reasonable when the > flink cluster finished normally. However, when the Yarn application is failed > or killed, the submittedJobGraph and completedCheckpoint will exist there > forever. Even we could not know which flink cluster(Yarn application) they > belongs to. So i suggest to move them into application subdirectory. Some > external tools could be used to clean up these residual files. > Also, we need to do best effort clean-up before the flink cluster finishes. > > Current ha storage directory structure > {code:java} > └── /tmp/flink/ha > ├── submittedJobGraphxxxx > ├── completedCheckpointxxxx > ├── application_xxxx_xxxx > │ ├── blob{code} > > The new ha storage directory structure > {code:java} > └── /tmp/flink/ha > ├── application_xxxx_xxxx > │ ├── blob > │ ├── submittedJobGraphxxxx > │ ├── completedCheckpointxxxx > {code} > > -- This message was sent by Atlassian JIRA (v7.6.14#76016)