The majority of the resolution to this happened on irc, #aurora / freenode, thanks to Steve :) i'm including it here "for logging" / searchability.
We're running all of mesos, aurora, thermos-observer, zookeeper and such in docker containers, mesos-slave in a container managing other containers on the same layer of "containerization" puts special requirements on the mesos-slave container. simple schematic showing how things are layered: (host -> docker docker -> mesos-slave docker -> mesos-task docker -> ... ) This requires that directories in the mesos-slave is on the same path in both the slave and on the host since mesos-slave mounts those directories from docker (host's) perspective into the mesos-task. The mesos-slave also seems to have it's own executor which keeps track of the aurora-executor's pid. If mesos-slave can't find the slave pid it will assume the task was lost. (this was the main issue i was banging my head against since this isn't logged anywhere if it fails, slave just prints "task_lost" and executor just exits..) This happens if mesos-slave is running in a docker-container where it will be in a isolated pid-space and therefore will not be able to find the pid of the slave-container's executor. By running the docker-container containing mesos-slave with "--pid=host" the mesos-slave will run in the same pid-space as the host, and will therefore be able to find aurora-executor's pid, and things will magically work.. :> For this (mesos-slave in docker-container, tasks in other containers and such) mesos-slave, thermos_observer and all slave/mesos tasks needs to be able to access the same directories. /sys needs to be mounted into the mesos-slave /sys directory so it can manage cgroups. /var/run/thermos needs to be mounted into thermos_observer's /var/run/thermos, this will also be mounted into all slave-tasks /var/run/thermos directory, i'm assuming so all slave-tasks can report back to thermos_observer that they're running or not.. /tmp/mesos (or the directory configured in mesos-slave where all tasks sandboxes are stored) needs to be mounted from host's /tmp/mesos into the mesos-slave's /tmp/mesos, mesos-slave will mount this from the host into each mesos-task container. Mesos-slave docker config: /etc/init/mesos-slave.conf # # This properly starts and stops a docker container through upstart. # description "run mesos-slave" respawn script docker pull docker-repo.service.consul:5000/trusty/mesos . /etc/default/cluster mkdir -p /tmp/mesos docker create -i -t --name=consul_$UPSTART_JOB --net=host \ -v /run/thermos:/run/thermos \ -v /tmp/mesos:/tmp/mesos \ -v /sys:/sys \ -v /usr/bin/docker:/usr/bin/docker \ -v /var/run/docker.sock:/var/run/docker.sock \ -e ULIMIT="-n 8192" \ -e MESOS_resources="ports:[13000-15999]" \ -e MESOS_FRAMEWORKS_HOME=/data/services/mesos/frameworks \ -e MESOS_launcher_dir=/usr/libexec/mesos \ -e MESOS_attributes="role:slave;rack:$RACK;host:`hostname`" \ --pid=host \ndocker-repo.service.consul:5000/trusty/mesos \ /usr/sbin/mesos-slave --master=zk://zookeeper.service.consul:2181/mesos/$CLUSTER --containerizers=docker,mesos --log_dir=/var/lib/mesos/log --executor_registration_timeout=5mins exec docker start -a consul_$UPSTART_JOB end script post-stop script docker stop -t 0 consul_$UPSTART_JOB docker rm consul_$UPSTART_JOB sleep 5 end script This doesn't seem to work properly for now. something is weird with stdout/stderr, i'll investigate a bit further. /etc/init/thermos-executor.conf: # # This properly starts and stops a docker container through upstart. # description "run thermos-observer" respawn script docker pull docker-repo.service.consul:5000/trusty/mesos . /etc/default/cluster docker create -i -t --name=$UPSTART_JOB --net=host \ -v /var/run/thermos:/var/run/thermos \ docker-repo.service.consul:5000/trusty/mesos /usr/local/bin/thermos_observer.pex --log_dir=/data/log/thermos-observer --log_simple exec docker start -a $UPSTART_JOB end script post-stop script docker stop -t 0 $UPSTART_JOB docker rm $UPSTART_JOB end script On Fri, Mar 13, 2015 at 6:18 PM, Steve Niemitz <st...@tellapart.com> wrote: > I just noticed this line in your previous email: > "Everything's included in the image, that's the image i'm running > mesos-master, > aurora, and mesos-slave from." > > Are you saying you're running the slave inside a docker container? If so, > mesos does not support running docker containers from inside another docker > container. > > On Fri, Mar 13, 2015 at 1:04 PM, Oskar Stenman <oskar.sten...@magine.com> > wrote: > > > Thanks for the reply! > > > > The only files i find in the workdir is stderr, stdout and > > thermos_executor.pex > > Nothing is cleaning the tmp directory, everything (except /data) is > running > > on ramdisk on these machines, it's the absolute minimal ubuntu "minbase" > > (debootstrapped) installation i've found possible and still generally > > usable. (around 100MB with docker and some other tools i include) > > > > > > root@s1:~# ls > > > > > /tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781 > > -l > > total 29000 > > -rw-r--r-x 1 root root 519 Mar 13 13:39 stderr > > -rw-r--r-x 1 root root 0 Mar 13 13:39 stdout > > -rwxr-xr-x 1 root root 29690639 Mar 13 13:39 thermos_executor.pex > > > > STDERR: > > root@s1:~# cat > > > > > /tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781/stderr > > WARNING: Logging before InitGoogleLogging() is written to STDERR > > I0313 13:39:45.608578 258 fetcher.cpp:76] Fetching URI > > '/usr/local/bin/thermos_executor.pex' > > I0313 13:39:45.609048 258 fetcher.cpp:179] Copying resource from > > '/usr/local/bin/thermos_executor.pex' to > > > > > '/tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781' > > > > /Oskar > > > > On Fri, Mar 13, 2015 at 5:54 PM, Bill Farner <wfar...@apache.org> wrote: > > > > > When the task launches, a sandbox directory is created, in the above > > e-mail > > > it > > > was > > > > > > /tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781. > > > Is there anything else in that directory? Please post anything logs > you > > > find in there as they may provide useful clues. > > > > > > Possibly unrelated, your slave work directory is /tmp. I have observed > > > that some environments a temp cleaner process runs that automatically > > > deletes things under here. This may not fix the issue at hand here, > but > > i > > > suggest you move this outside /tmp as it is critical state for the > slave. > > > > > > > > > -=Bill > > > > > > On Fri, Mar 13, 2015 at 9:24 AM, Oskar Stenman < > oskar.sten...@magine.com > > > > > > wrote: > > > > > > > I wasn't subscribed to the mailing-list (I'm subscribed now though) > so > > > > i'm sorry if this reply ends up in the wrong place.. > > > > > > > > >From the mesos slave log it looks like the executor is failing. > Most > > > > likely the issue is your image doesn't have the native libraries > needed > > > to > > > > run it. The next step would be to look in the sandbox for a failed > run > > > > (you can find the path in the slave logs) and look at the stderr log > > for > > > > errors. > > > > > > > > Everything's included in the image, that's the image i'm running > > > > mesos-master, aurora, and mesos-slave from. > > > > > > > > Stdout was included in the last email: > > > > > > > > WARNING: Logging before InitGoogleLogging() is written to STDERR > > > > I0313 13:39:45.608578 258 fetcher.cpp:76] Fetching URI > > > > '/usr/local/bin/thermos_executor.pex' > > > > I0313 13:39:45.609048 258 fetcher.cpp:179] Copying resource from > > > > '/usr/local/bin/thermos_executor.pex' to > > > > > > > > > > > > > > '/tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781' > > > > > > > > Also, when relaunching the container manually: > > > > > > > > > > > > 1. root@s1:/var/log/upstart# docker start -i > > > > mesos-cdf5e59f-c2be-47ba-b30a-2a690657e248 > > > > 2. twitter.common.app debug: Initializing: twitter.common.log > > > > (Logging subsystem.) > > > > 3. Writing log files to disk in /mnt/mesos/sandbox > > > > 4. I0313 14:50:36.273181 5 exec.cpp:132] Version: 0.21.1 > > > > 5. I0313 14:50:36.278043 29 exec.cpp:379] Executor asked to > > > shutdown > > > > 6. Killed > > > > > > > > > > > > It seems to be really tricky to troubleshoot from my point of view as > > > > i don't have any output at all. > > > > Is the executor giving up immediately? is it even trying to connect > to > > > > the slave? > > > > Parameters to the executor missing? (like job-config or something) > > > > Directory doesn't contain what it wants? > > > > Network config wrong? > > > > > > > > /Oskar > > > > > > > > > > > > On Fri, Mar 13, 2015 at 3:08 PM, Oskar Stenman < > > oskar.sten...@magine.com > > > > > > > > wrote: > > > > > > > > > Hi! > > > > > > > > > > I'm investigating aurora + mesos + docker and i'm stuck. > > > > > > > > > > I can create the hello world docker-task in aurora, it gets > assigned > > a > > > > > slave, the docker-container is launched but the executor > immediately > > > > > terminates and it ends up in "task lost" state. > > > > > > > > > > Can anyone make any sense of this or tell me how to troubleshoot > > > further? > > > > > > > > > > > > > > > -- > > > > > [image: MagineTV] > > > > > > > > > > *Oskar Stenman* > > > > > Network Architect > > > > > > > > > > *Magine TV* > > > > > oskar.sten...@magine.com | Mob: +46 70 565 21 52 > > > > > Regeringsgatan 25 | 111 53 Stockholm, Sweden | www.magine.com > > > > > <http://www.magine.com/> > > > > > > > > > > Privileged and/or Confidential Information may be contained in this > > > > > message. If you are not the addressee indicated in this message > > > > > (or responsible for delivery of the message to such a person), you > > may > > > > not > > > > > copy or deliver this message to anyone. In such case, > > > > > you should destroy this message and kindly notify the sender by > reply > > > > > email. > > > > > > > > > > > > > > > > > > > > > -- > > > > [image: MagineTV] > > > > > > > > *Oskar Stenman* > > > > Network Architect > > > > > > > > *Magine TV* > > > > oskar.sten...@magine.com | Mob: +46 70 565 21 52 > > > > Regeringsgatan 25 | 111 53 Stockholm, Sweden | www.magine.com > > > > <http://www.magine.com/> > > > > > > > > Privileged and/or Confidential Information may be contained in this > > > > message. If you are not the addressee indicated in this message > > > > (or responsible for delivery of the message to such a person), you > may > > > not > > > > copy or deliver this message to anyone. In such case, > > > > you should destroy this message and kindly notify the sender by reply > > > > email. > > > > > > > > > > > > > > > -- > > [image: MagineTV] > > > > *Oskar Stenman* > > Network Architect > > > > *Magine TV* > > oskar.sten...@magine.com | Mob: +46 70 565 21 52 > > Regeringsgatan 25 | 111 53 Stockholm, Sweden | www.magine.com > > <http://www.magine.com/> > > > > Privileged and/or Confidential Information may be contained in this > > message. If you are not the addressee indicated in this message > > (or responsible for delivery of the message to such a person), you may > not > > copy or deliver this message to anyone. In such case, > > you should destroy this message and kindly notify the sender by reply > > email. > > > -- [image: MagineTV] *Oskar Stenman* Network Architect *Magine TV* oskar.sten...@magine.com | Mob: +46 70 565 21 52 Regeringsgatan 25 | 111 53 Stockholm, Sweden | www.magine.com <http://www.magine.com/> Privileged and/or Confidential Information may be contained in this message. If you are not the addressee indicated in this message (or responsible for delivery of the message to such a person), you may not copy or deliver this message to anyone. In such case, you should destroy this message and kindly notify the sender by reply email.