The majority of the resolution to this happened on irc, #aurora / freenode,
thanks to Steve :)
i'm including it here "for logging" / searchability.

We're running all of mesos, aurora, thermos-observer, zookeeper and such in
docker containers, mesos-slave in a container managing other containers on
the same layer of "containerization" puts special requirements on the
mesos-slave container.
simple schematic showing how things are layered:
(host -> docker
  docker -> mesos-slave
  docker -> mesos-task
  docker  -> ...
)

This requires that directories in the mesos-slave is on the same path in
both the slave and on the host since mesos-slave mounts those directories
from docker (host's) perspective into the mesos-task.

The mesos-slave also seems to have it's own executor which keeps track of
the aurora-executor's pid.
If mesos-slave can't find the slave pid it will assume the task was lost.
(this was the main issue i was banging my head against since this isn't
logged anywhere if it fails, slave just prints "task_lost" and executor
just exits..)
This happens if mesos-slave is running in a docker-container where it will
be in a isolated pid-space and therefore will not be able to find the pid
of the slave-container's executor.
By running the docker-container containing mesos-slave with "--pid=host"
the mesos-slave will run in the same pid-space as the host, and will
therefore be able to find aurora-executor's pid, and things will magically
work.. :>

For this (mesos-slave in docker-container, tasks in other containers and
such) mesos-slave, thermos_observer and all slave/mesos tasks needs to be
able to access the same directories.
/sys needs to be mounted into the mesos-slave /sys directory so it can
manage cgroups.
/var/run/thermos needs to be mounted into thermos_observer's
/var/run/thermos, this will also be mounted into all slave-tasks
/var/run/thermos directory, i'm assuming so all slave-tasks can report back
to thermos_observer that they're running or not..
/tmp/mesos (or the directory configured in mesos-slave where all tasks
sandboxes are stored) needs to be mounted from host's /tmp/mesos into the
mesos-slave's /tmp/mesos, mesos-slave will mount this from the host into
each mesos-task container.

Mesos-slave docker config:
/etc/init/mesos-slave.conf
#
# This properly starts and stops a docker container through upstart.
#

description     "run mesos-slave"

respawn

script
docker pull docker-repo.service.consul:5000/trusty/mesos
. /etc/default/cluster
mkdir -p /tmp/mesos

docker create -i -t --name=consul_$UPSTART_JOB --net=host \
-v /run/thermos:/run/thermos \
-v /tmp/mesos:/tmp/mesos \
-v /sys:/sys \
-v /usr/bin/docker:/usr/bin/docker \
-v /var/run/docker.sock:/var/run/docker.sock \
-e ULIMIT="-n 8192" \
-e MESOS_resources="ports:[13000-15999]" \
-e MESOS_FRAMEWORKS_HOME=/data/services/mesos/frameworks \
-e MESOS_launcher_dir=/usr/libexec/mesos \
-e MESOS_attributes="role:slave;rack:$RACK;host:`hostname`" \
--pid=host \ndocker-repo.service.consul:5000/trusty/mesos \
/usr/sbin/mesos-slave
--master=zk://zookeeper.service.consul:2181/mesos/$CLUSTER
--containerizers=docker,mesos --log_dir=/var/lib/mesos/log
--executor_registration_timeout=5mins

exec docker start -a consul_$UPSTART_JOB
end script

post-stop script
docker stop -t 0 consul_$UPSTART_JOB
docker rm consul_$UPSTART_JOB
sleep 5
end script


This doesn't seem to work properly for now. something is weird with
stdout/stderr, i'll investigate a bit further.
/etc/init/thermos-executor.conf:
#
# This properly starts and stops a docker container through upstart.
#

description     "run thermos-observer"

respawn

script
docker pull docker-repo.service.consul:5000/trusty/mesos
. /etc/default/cluster
docker create -i -t --name=$UPSTART_JOB --net=host \
-v /var/run/thermos:/var/run/thermos \
docker-repo.service.consul:5000/trusty/mesos
/usr/local/bin/thermos_observer.pex --log_dir=/data/log/thermos-observer
--log_simple
exec docker start -a $UPSTART_JOB
end script

post-stop script
docker stop -t 0 $UPSTART_JOB
docker rm $UPSTART_JOB
end script


On Fri, Mar 13, 2015 at 6:18 PM, Steve Niemitz <st...@tellapart.com> wrote:

> I just noticed this line in your previous email:
> "Everything's included in the image, that's the image i'm running
> mesos-master,
> aurora, and mesos-slave from."
>
> Are you saying you're running the slave inside a docker container?  If so,
> mesos does not support running docker containers from inside another docker
> container.
>
> On Fri, Mar 13, 2015 at 1:04 PM, Oskar Stenman <oskar.sten...@magine.com>
> wrote:
>
> > Thanks for the reply!
> >
> > The only files i find in the workdir is stderr, stdout and
> > thermos_executor.pex
> > Nothing is cleaning the tmp directory, everything (except /data) is
> running
> > on ramdisk on these machines, it's the absolute minimal ubuntu "minbase"
> > (debootstrapped) installation i've found possible and still generally
> > usable. (around 100MB with docker and some other tools i include)
> >
> >
> > root@s1:~# ls
> >
> >
> /tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781
> > -l
> > total 29000
> > -rw-r--r-x 1 root root      519 Mar 13 13:39 stderr
> > -rw-r--r-x 1 root root        0 Mar 13 13:39 stdout
> > -rwxr-xr-x 1 root root 29690639 Mar 13 13:39 thermos_executor.pex
> >
> > STDERR:
> > root@s1:~# cat
> >
> >
> /tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781/stderr
> > WARNING: Logging before InitGoogleLogging() is written to STDERR
> > I0313 13:39:45.608578   258 fetcher.cpp:76] Fetching URI
> > '/usr/local/bin/thermos_executor.pex'
> > I0313 13:39:45.609048   258 fetcher.cpp:179] Copying resource from
> > '/usr/local/bin/thermos_executor.pex' to
> >
> >
> '/tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781'
> >
> > /Oskar
> >
> > On Fri, Mar 13, 2015 at 5:54 PM, Bill Farner <wfar...@apache.org> wrote:
> >
> > > When the task launches, a sandbox directory is created, in the above
> > e-mail
> > > it
> > > was
> > >
> >
> /tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781.
> > > Is there anything else in that directory?  Please post anything logs
> you
> > > find in there as they may provide useful clues.
> > >
> > > Possibly unrelated, your slave work directory is /tmp.  I have observed
> > > that some environments a temp cleaner process runs that automatically
> > > deletes things under here.  This may not fix the issue at hand here,
> but
> > i
> > > suggest you move this outside /tmp as it is critical state for the
> slave.
> > >
> > >
> > > -=Bill
> > >
> > > On Fri, Mar 13, 2015 at 9:24 AM, Oskar Stenman <
> oskar.sten...@magine.com
> > >
> > > wrote:
> > >
> > > > I wasn't subscribed to the mailing-list (I'm subscribed now though)
> so
> > > > i'm sorry if this reply ends up in the wrong place..
> > > >
> > > > >From the mesos slave log it looks like the executor is failing.
> Most
> > > > likely the issue is your image doesn't have the native libraries
> needed
> > > to
> > > > run it.  The next step would be to look in the sandbox for a failed
> run
> > > > (you can find the path in the slave logs) and look at the stderr log
> > for
> > > > errors.
> > > >
> > > > Everything's included in the image, that's the image i'm running
> > > > mesos-master, aurora, and mesos-slave from.
> > > >
> > > > Stdout was included in the last email:
> > > >
> > > > WARNING: Logging before InitGoogleLogging() is written to STDERR
> > > > I0313 13:39:45.608578   258 fetcher.cpp:76] Fetching URI
> > > > '/usr/local/bin/thermos_executor.pex'
> > > > I0313 13:39:45.609048   258 fetcher.cpp:179] Copying resource from
> > > > '/usr/local/bin/thermos_executor.pex' to
> > > >
> > > >
> > >
> >
> '/tmp/mesos/slaves/20150313-131712-1143806393-5050-6-S0/frameworks/20150306-112428-1177360825-5050-6-0000/executors/thermos-1426253925515-docker-test-devel-hello_docker-0-a9011a74-c2a2-4cb7-b402-d383fde58c41/runs/e53d1267-e341-46f5-9759-0361c7440781'
> > > >
> > > > Also, when relaunching the container manually:
> > > >
> > > >
> > > >    1. root@s1:/var/log/upstart# docker start -i
> > > > mesos-cdf5e59f-c2be-47ba-b30a-2a690657e248
> > > >    2. twitter.common.app debug: Initializing: twitter.common.log
> > > > (Logging subsystem.)
> > > >    3. Writing log files to disk in /mnt/mesos/sandbox
> > > >    4. I0313 14:50:36.273181     5 exec.cpp:132] Version: 0.21.1
> > > >    5. I0313 14:50:36.278043    29 exec.cpp:379] Executor asked to
> > > shutdown
> > > >    6. Killed
> > > >
> > > >
> > > > It seems to be really tricky to troubleshoot from my point of view as
> > > > i don't have any output at all.
> > > > Is the executor giving up immediately? is it even trying to connect
> to
> > > > the slave?
> > > > Parameters to the executor missing? (like job-config or something)
> > > > Directory doesn't contain what it wants?
> > > > Network config wrong?
> > > >
> > > > /Oskar
> > > >
> > > >
> > > > On Fri, Mar 13, 2015 at 3:08 PM, Oskar Stenman <
> > oskar.sten...@magine.com
> > > >
> > > > wrote:
> > > >
> > > > > Hi!
> > > > >
> > > > > I'm investigating aurora + mesos + docker and i'm stuck.
> > > > >
> > > > > I can create the hello world docker-task in aurora, it gets
> assigned
> > a
> > > > > slave, the docker-container is launched but the executor
> immediately
> > > > > terminates and it ends up in "task lost" state.
> > > > >
> > > > > Can anyone make any sense of this or tell me how to troubleshoot
> > > further?
> > > > >
> > > > >
> > > > > --
> > > > > [image: MagineTV]
> > > > >
> > > > > *Oskar Stenman*
> > > > > Network Architect
> > > > >
> > > > > *Magine TV*
> > > > > oskar.sten...@magine.com  |   Mob: +46 70 565 21 52
> > > > > Regeringsgatan 25  | 111 53 Stockholm, Sweden  |   www.magine.com
> > > > > <http://www.magine.com/>
> > > > >
> > > > > Privileged and/or Confidential Information may be contained in this
> > > > > message. If you are not the addressee indicated in this message
> > > > > (or responsible for delivery of the message to such a person), you
> > may
> > > > not
> > > > > copy or deliver this message to anyone. In such case,
> > > > > you should destroy this message and kindly notify the sender by
> reply
> > > > > email.
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > [image: MagineTV]
> > > >
> > > > *Oskar Stenman*
> > > > Network Architect
> > > >
> > > > *Magine TV*
> > > > oskar.sten...@magine.com  |   Mob: +46 70 565 21 52
> > > > Regeringsgatan 25  | 111 53 Stockholm, Sweden  |   www.magine.com
> > > > <http://www.magine.com/>
> > > >
> > > > Privileged and/or Confidential Information may be contained in this
> > > > message. If you are not the addressee indicated in this message
> > > > (or responsible for delivery of the message to such a person), you
> may
> > > not
> > > > copy or deliver this message to anyone. In such case,
> > > > you should destroy this message and kindly notify the sender by reply
> > > > email.
> > > >
> > >
> >
> >
> >
> > --
> > [image: MagineTV]
> >
> > *Oskar Stenman*
> > Network Architect
> >
> > *Magine TV*
> > oskar.sten...@magine.com  |   Mob: +46 70 565 21 52
> > Regeringsgatan 25  | 111 53 Stockholm, Sweden  |   www.magine.com
> > <http://www.magine.com/>
> >
> > Privileged and/or Confidential Information may be contained in this
> > message. If you are not the addressee indicated in this message
> > (or responsible for delivery of the message to such a person), you may
> not
> > copy or deliver this message to anyone. In such case,
> > you should destroy this message and kindly notify the sender by reply
> > email.
> >
>



-- 
[image: MagineTV]

*Oskar Stenman*
Network Architect

*Magine TV*
oskar.sten...@magine.com  |   Mob: +46 70 565 21 52
Regeringsgatan 25  | 111 53 Stockholm, Sweden  |   www.magine.com
<http://www.magine.com/>

Privileged and/or Confidential Information may be contained in this
message. If you are not the addressee indicated in this message
(or responsible for delivery of the message to such a person), you may not
copy or deliver this message to anyone. In such case,
you should destroy this message and kindly notify the sender by reply email.

Reply via email to