Hmm this is not very gentleman-like to terminate the Job/TaskManagers. I'll check how the ActorSystem behaves in case of killing the process.
Why can't we implement a more graceful termination mechanism? For example, we could send a termination message to the JobManager and TaskManagers. On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <u...@apache.org> wrote: > Thank you very much, Robert! > > The problem is that the job/task manager shutdown methods are never > called. When using the scripts, the task/job manager processes get killed > and therefore shutdown methods are never called. > > @Till: Do you know whether there is a mechanism in Akka to register the > actors for JVM shutdown hooks? I tried to register a shutdown hook via > Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a > reference to the task manager. > > > On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <trohrm...@apache.org> > wrote: > >> Hi Robert, >> >> thanks for the info. If the TaskManager/JobManager does not shutdown >> properly, i.e. killing of the process, then it is indeed the case that the >> BlobManager cannot properly remove all stored files. I don't know if this >> was lately the case for you. Furthermore, the files are not directly >> deleted after the job has finished. Internally there is a cleanup task >> which is triggered every our and deletes all blobs which are no longer >> referenced. >> >> But we definitely have to look into it to see how we could improve this >> behaviour. >> >> Greets, >> >> Till >> >> On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury <robert.wa...@googlemail.com >> > wrote: >> >>> I talked with the admins. The problem seemed to have been that the disk >>> was full and Flink couldn't create the directory. >>> >>> Maybe the the error message should reflect if that is the cause. >>> >>> While cleaning up the disk we noticed that a lot of temporary blobStore >>> files were not deleted by Flink after the job finished. This seemed to have >>> caused or at least worsened the problem. >>> >>> Cheers, >>> Robert >>> >>> On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <u...@apache.org> wrote: >>> >>>> On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury < >>>> robert.wa...@googlemail.com> wrote: >>>> >>>>> Hi, >>>>> >>>>> I can reproduce the error on my cluster. >>>>> >>>>> Unfortunately I can't check whether the parent directories were >>>>> created on the different nodes since I have no way of accessing them. I >>>>> start all the jobs from a gateway. >>>>> >>>> >>>> I've added a check to the directory creation (in branches release-0.8 >>>> and master), which should fail with a proper error message if that is the >>>> problem. If you have time to (re)deploy Flink, it would be great to know if >>>> that indeed is the issue. Otherwise, we need to further investigate this. >>>> >>>> >>>> >>> >> >