Re: Blob Server Removes Failed Jobs Immediately

Till Rohrmann Wed, 20 Jun 2018 23:20:02 -0700

Hi Dominik,

all job related files (non-HA as well as HA) are removed once the job
reaches a globally terminal state (FINISHED, CANCELLED, FAILED). This is
the case because Flink assumes that the job is done and won't be retried
afterwards. Thus, the documentation in the Flip is not true and should be
corrected.


Cheers,
Till

On Wed, Jun 20, 2018 at 7:11 PM Chesnay Schepler <ches...@apache.org> wrote:

> hmm, this indeed looks odd. Looping in Till (cc) who might know more about
> this.
>
> On 20.06.2018 16:43, Dominik Wosiński wrote:
>
> Hello,
>
> I'm not sure whether the problem is connected with bad configuration or
> it's some inconsistency in the documentation but according to this document:
>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-19%3A+Improved+BLOB+storage+architecture
> . *I*f a job fails, all non-HA files' refCounts are reset to 0; all HA *files'
> refCounts remain and will not be increased again on recovery. *But in the
> JobManager's code if the Job Status is changed to failed and the JobManager
> receive the message with that fact, it will send *RemoveJob* message to
> itself, which invokes *removeJob() *function that always invokes
> following functions :
>
> libraryCacheManager.unregisterJob(jobID)
> blobServer.cleanupJob(jobID, removeJobFromStateBackend)
>
> jobManagerMetricGroup.removeJob(jobID)
>
> As far as I understand this removes blob entries immediately. And
> according to the doc it should only freeze refCounts for HA files and reset
> refCounts for non-Ha files to allow their later removal.
> Is the doc right and I have missed something here ?
> Thanks in Advance.
>
>
>

Re: Blob Server Removes Failed Jobs Immediately

Reply via email to