Hi folks, I've just had my first task fail due to exceeding disk capacity, and I've run into some strange behaviour.
It's a Java process that's running inside a Docker container specified in the task config. The Java process is failing with java.io.IOException: No space left on device when attempting to write a file. Three things are (or aren't) then happening which I think are just plain wrong: 1. The task is being marked as failed (good!) but isn't reporting that it exceeded disk limits (bad). I was expecting to see the "Disk limit exceeded. Reserved X bytes vs used Y bytes." message, but neither the Mesos nor Aurora web interfaces are telling me this. 2. The task's sandbox directory is being nuked. All of it, immediately. There while the job is running, vanished as soon as it fails (I happened to be watching it live). This makes debugging difficult, and the Aurora/Thermos web UI clearly has trouble because it reports the resource requests as all zero when they most definitely weren't. 3. Finalizers aren't running. No finalizers = no error log = no debugging = sadface. :( I think what's actually happening here is that the process is running out of disk on the machine itself and that IOException is propagating up from the kernel, rather than Mesos killing the process from its disk usage monitoring. As such, we're going to try configuring the Mesos slaves with --resources='disk:some_smaller_value' to leave a little overhead in the hope that the Mesos disk monitor catches the overusage before the process attempts to claim the last free block on disk. I don't know why it'd be nuking the sandbox, though. And is the GC executor more aggressive about cleaning out old sandbox directories if the disk is low on free space? If it helps, we're on Aurora commit 2bf03dc5eae89b1e40bfd47683c54c185c78a9d3. Thanks, Hussein Elgridly Senior Software Engineer, DSDE The Broad Institute of MIT and Harvard