Thanks, that's helpful. I've also just discovered that Thermos only
monitors disk usage in the sandbox location, so if we launch a Docker job
and write to anywhere that's not /mnt/mesos/sandbox, we can exceed our disk
quota. I can work around this by turning our scratch space directories into
symlinks located under the sandbox, though.

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard


On 8 April 2015 at 19:43, Zameer Manji <zma...@apache.org> wrote:

> Hey,
>
> The deletion of sandbox directories is done by the Mesos slave not the GC
> executor. You will have to ask Mesos devs on the relationship between low
> disk and sandbox deletion.
>
> The executor enforces disk usage by running `du` in the background
> periodically. I suspect in your case your process fails before the executor
> notices the disk usage has been exceeded and marks the task as failed. This
> explains why the disk usage message is not there.
>
> I'm not sure why the finalizers are not running, but you should note that
> they are best effort by the executor. The executor won't be able to run
> them if Mesos tears down the container from underneath it for example.
>
> On Mon, Apr 6, 2015 at 10:30 AM, Hussein Elgridly <
> huss...@broadinstitute.org> wrote:
>
> > Hi folks,
> >
> > I've just had my first task fail due to exceeding disk capacity, and I've
> > run into some strange behaviour.
> >
> > It's a Java process that's running inside a Docker container specified in
> > the task config. The Java process is failing with java.io.IOException: No
> > space left on device when attempting to write a file.
> >
> > Three things are (or aren't) then happening which I think are just plain
> > wrong:
> >
> > 1. The task is being marked as failed (good!) but isn't reporting that it
> > exceeded disk limits (bad). I was expecting to see the "Disk limit
> > exceeded.  Reserved X bytes vs used Y bytes." message, but neither the
> > Mesos nor Aurora web interfaces are telling me this.
> > 2. The task's sandbox directory is being nuked. All of it, immediately.
> > There while the job is running, vanished as soon as it fails (I happened
> to
> > be watching it live). This makes debugging difficult, and the
> > Aurora/Thermos web UI clearly has trouble because it reports the resource
> > requests as all zero when they most definitely weren't.
> > 3. Finalizers aren't running. No finalizers = no error log = no
> debugging =
> > sadface. :(
> >
> > I think what's actually happening here is that the process is running out
> > of disk on the machine itself and that IOException is propagating up from
> > the kernel, rather than Mesos killing the process from its disk usage
> > monitoring.
> >
> > As such, we're going to try configuring the Mesos slaves with
> > --resources='disk:some_smaller_value' to leave a little overhead in the
> > hope that the Mesos disk monitor catches the overusage before the process
> > attempts to claim the last free block on disk.
> >
> > I don't know why it'd be nuking the sandbox, though. And is the GC
> executor
> > more aggressive about cleaning out old sandbox directories if the disk is
> > low on free space?
> >
> > If it helps, we're on Aurora commit
> > 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3.
> >
> > Thanks,
> >
> > Hussein Elgridly
> > Senior Software Engineer, DSDE
> > The Broad Institute of MIT and Harvard
> >
> > --
> > Zameer Manji
> >
> >
>

Reply via email to