Blob Server Removes Failed Jobs Immediately

Dominik Wosiński Wed, 20 Jun 2018 07:43:24 -0700

Hello,

I'm not sure whether the problem is connected with bad configuration or
it's some inconsistency in the documentation but according to this document:


https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture
. *I*f a job fails, all non-HA files' refCounts are reset to 0; all HA *files'
refCounts remain and will not be increased again on recovery. *But in the
JobManager's code if the Job Status is changed to failed and the JobManager
receive the message with that fact, it will send *RemoveJob* message to
itself, which invokes *removeJob() *function that always invokes following
functions :

libraryCacheManager.unregisterJob(jobID)
blobServer.cleanupJob(jobID, removeJobFromStateBackend)

jobManagerMetricGroup.removeJob(jobID)

As far as I understand this removes blob entries immediately. And according
to the doc it should only freeze refCounts for HA files and reset refCounts
for non-Ha files to allow their later removal.
Is the doc right and I have missed something here ?
Thanks in Advance.

Blob Server Removes Failed Jobs Immediately

Reply via email to