Sounds good. In the course of this, we should probably extend the IOManager that it keeps track of temp files and deletes them when a task is done.
On Thu, Feb 5, 2015 at 4:40 PM, Ufuk Celebi <u...@apache.org> wrote: > After talking to Robert and Till offline, what about the following: > > - We add a shutdown hook to the blob library cache manager to shutdown the > blob service (just a delete call) > - As Robert pointed out, we cannot do this with the IOManager paths right > now, because they are essentially shared among multiple Flink instances. > Therefore we add an IOManager directory per Flink instance as well, which > we can simply delete on shutdown. > > Is that OK? > > On Thu, Feb 5, 2015 at 4:23 PM, Stephan Ewen <se...@apache.org> wrote: > >> I think that process killing (HALT signal) is a very typical way in Linux >> to shut down processes. It is the most robust way, since it does not >> require to send any custom messages to the process. >> >> This is sort of graceful, as the JVM gets the signal and may do a lot of >> things before shutting down, such as running shutdown hooks. The ungraceful >> variant is the KILL signal, which just removes the process. >> >> >> >> On Thu, Feb 5, 2015 at 4:16 PM, Till Rohrmann <trohrm...@apache.org> >> wrote: >> >>> Hmm this is not very gentleman-like to terminate the Job/TaskManagers. >>> I'll check how the ActorSystem behaves in case of killing the process. >>> >>> Why can't we implement a more graceful termination mechanism? For >>> example, we could send a termination message to the JobManager and >>> TaskManagers. >>> >>> On Thu, Feb 5, 2015 at 4:10 PM, Ufuk Celebi <u...@apache.org> wrote: >>> >>>> Thank you very much, Robert! >>>> >>>> The problem is that the job/task manager shutdown methods are never >>>> called. When using the scripts, the task/job manager processes get killed >>>> and therefore shutdown methods are never called. >>>> >>>> @Till: Do you know whether there is a mechanism in Akka to register the >>>> actors for JVM shutdown hooks? I tried to register a shutdown hook via >>>> Runtime.getRuntime().addShutdownHook(), but I didn't manage to get a >>>> reference to the task manager. >>>> >>>> >>>> On Thu, Feb 5, 2015 at 3:29 PM, Till Rohrmann <trohrm...@apache.org> >>>> wrote: >>>> >>>>> Hi Robert, >>>>> >>>>> thanks for the info. If the TaskManager/JobManager does not shutdown >>>>> properly, i.e. killing of the process, then it is indeed the case that the >>>>> BlobManager cannot properly remove all stored files. I don't know if this >>>>> was lately the case for you. Furthermore, the files are not directly >>>>> deleted after the job has finished. Internally there is a cleanup task >>>>> which is triggered every our and deletes all blobs which are no longer >>>>> referenced. >>>>> >>>>> But we definitely have to look into it to see how we could improve >>>>> this behaviour. >>>>> >>>>> Greets, >>>>> >>>>> Till >>>>> >>>>> On Thu, Feb 5, 2015 at 3:21 PM, Robert Waury < >>>>> robert.wa...@googlemail.com> wrote: >>>>> >>>>>> I talked with the admins. The problem seemed to have been that the >>>>>> disk was full and Flink couldn't create the directory. >>>>>> >>>>>> Maybe the the error message should reflect if that is the cause. >>>>>> >>>>>> While cleaning up the disk we noticed that a lot of temporary >>>>>> blobStore files were not deleted by Flink after the job finished. This >>>>>> seemed to have caused or at least worsened the problem. >>>>>> >>>>>> Cheers, >>>>>> Robert >>>>>> >>>>>> On Thu, Feb 5, 2015 at 1:14 PM, Ufuk Celebi <u...@apache.org> wrote: >>>>>> >>>>>>> On Thu, Feb 5, 2015 at 11:23 AM, Robert Waury < >>>>>>> robert.wa...@googlemail.com> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> I can reproduce the error on my cluster. >>>>>>>> >>>>>>>> Unfortunately I can't check whether the parent directories were >>>>>>>> created on the different nodes since I have no way of accessing them. I >>>>>>>> start all the jobs from a gateway. >>>>>>>> >>>>>>> >>>>>>> I've added a check to the directory creation (in branches >>>>>>> release-0.8 and master), which should fail with a proper error message >>>>>>> if >>>>>>> that is the problem. If you have time to (re)deploy Flink, it would be >>>>>>> great to know if that indeed is the issue. Otherwise, we need to further >>>>>>> investigate this. >>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >