[ https://issues.apache.org/jira/browse/FLINK-6020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15999405#comment-15999405 ]
ASF GitHub Bot commented on FLINK-6020: --------------------------------------- Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3525 @WangTaoTheTonic I have debugged a bit further in this issue, and it seems there is a bit more to do. For non-HA blob servers, the atomic rename fix would do it. For HA cases, we need to do a bit more. A recent change was that the blob cache will try and fetch blobs directly from the blob store, which may cause pre-mature reads before the blob has been fully written. Because the storage systems we target for HA do not all support atomic renames (S3 does not), we need to use the `_SUCCESS` file trick to mark completed blobs. I chatted with @tillrohrmann about that, he agreed to take a look at fixing these and will make an effort to get this into the 1.3 release. Hope that this will work for you. > Blob Server cannot handle multiple job submits (with same content) parallelly > ----------------------------------------------------------------------------- > > Key: FLINK-6020 > URL: https://issues.apache.org/jira/browse/FLINK-6020 > Project: Flink > Issue Type: Sub-task > Components: Distributed Coordination > Reporter: Tao Wang > Assignee: Tao Wang > Priority: Critical > > In yarn-cluster mode, if we submit one same job multiple times parallelly, > the task will encounter class load problem and lease occuputation. > Because blob server stores user jars in name with generated sha1sum of those, > first writes a temp file and move it to finalialize. For recovery it also > will put them to HDFS with same file name. > In same time, when multiple clients sumit same job with same jar, the local > jar files in blob server and those file on hdfs will be handled in multiple > threads(BlobServerConnection), and impact each other. > It's better to have a way to handle this, now two ideas comes up to my head: > 1. lock the write operation, or > 2. use some unique identifier as file name instead of ( or added up to) > sha1sum of the file contents. -- This message was sent by Atlassian JIRA (v6.3.15#6346)