[OMPI users] Docker Cluster Queue Manager

2016-06-02 Thread Rob Nagler
We would like to use MPI on Docker with arbitrarily configured clusters
(e.g. created with StarCluster or bare metal). What I'm curious about is if
there is a queue manager that understands Docker, file systems, MPI, and
OpenAuth. JupyterHub does a lot of this, but it doesn't interface with MPI.
Ideally, we'd like users to be able to queue up jobs directly from
JupyterHub.

Currently, we can configure and initiate an MPI-compatible Docker cluster
running on a VPC using Salt. What's missing is the ability to manage a
queue of these clusters. Here's a list of requirements:


   - JupyterHub users do not have Unix user ids
   - Containers must be started as a non-root guest user (--user)
   - JupyterHub user's data directory is mounted in container
   - Data is shared via NFS or other cluster file system
   - sshd runs in container for MPI as guest user
   - Results have to be reported back to GitHub user
   - MPI network must be visible (--net=host)
   - Queue manager must be compatible with the above
   - JupyterHub user is not allowed to interact with Docker directly
   - Docker images are user selectable (from an approved list)
   - Jupyter and MPI containers started from same image

Know of a system which supports this?

Our code and config are open source, and your feedback would be greatly
appreciated.

Salt configuration: https://github.com/radiasoft/salt-conf
Container builders:
https://github.com/radiasoft/containers/tree/master/radiasoft
Early phase wiki: https://github.com/radiasoft/devops/wiki/DockerMPI

Thanks,
Rob


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-02 Thread Rob Nagler
Thanks, Ralph. I'm not sure I explained the problem clearly. Salt and
JupyterHub are distractions, sorry.

I have code which "wires up" a cluster for MPI. What I need is scheduler
that allows users to:

* Select which Docker image they'd like to wire up
* Request a number of nodes/cores
* Understands that clusters can be dynamically created
* Invokes an external command to create the cluster

Here's what I'd like the user be able to do:

$ queue-job --image=radiasoft/beamsim --cores=5000 my-script.sh

queue-job would then have to be able to call a 3rd-party module to get the
user:

# 3rd-party-environ-collector

This command would return a logical user and a network-accessible
directory. This info would be added to the queue, and then when the
scheduler decided to start the job, it would call:

# 3rd-party-start --image=radiasoft/beamsim --cores=5000 --user=robnagler
--mount=nfs://
intra.server.com/var/nfs/bla/robnagler/foo/bar:/home/docker-user/run
 my-script.sh

The bit that I need is the scheduling. For example, the scheduler would
have give the user a maximum number of core hours. It would maybe give the
job a unique group id (a la Grid Engine) to manage disk quotas. These
constraints would need to be passed to the 3rd-party programs so they could
constrain the doctor container.

What I have now is 3rd-party-environ-collector (JupyterHub) and
3rd-party-start (Salt). What I need is a scheduler that has an architecture
that supports dynamic clusters and users who have no local credentials
(non-Unix user -- just a name and a home directory).

TIA,
Rob


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-03 Thread Rob Nagler
Hi John,

Thanks for your thoughts. Lots of new technology out there!

have you looked at Singularity
> https://github.com/gmkurtzer/singularity/releases/tag/2.0
>

Looks very cool, but it doesn't address our problem. We already have the
images built with our codes. Our users don't have Unix user ids. We know
their execution environment. It doesn't handle queueing, which the problem
I have.


> I was gobsmacked to see how easy it was to install Juia ClusterManagers
> and get Slurm integration.
>

This is very nice, and something many Docker-oriented tools have. What they
don't have is good multi-user support. Remember, you can't let people run
Docker directly, because it gives them root access to the machine. The
queue manager has to control that part. You don't even want to start the
container as root, because you might be running an arbitrary container.


> ps. Also have you looked at Bright Cluster Manager?
> http://www.brightcomputing.com/whats-new-in-7.2
>

We want both bare metal and commercial VPCs. Provisioning bare metal is not
a problem we have right now. Our cluster is small and already provisioned.
For VPCs, we can use StarCluster to launch the cluster in the cloud, but
that cluster is standalone. The queue manager needs to know it was created
and push the user's environment to it.

The interesting times we are living in are at odds with our
infrastructure-oriented past. Clusters can come and go, and users can
package their code portably. The "module load" systems like Bright Cluster
offers are irrelevant. Let users build their images as they like with only
a few requirements, and they can run them with JupyterHub AND in an HPC
environment, which eliminates the need for Singularity.

Rob


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-03 Thread Rob Nagler
Hi John,

> What is the use case here - are you just wanting the codes to execute
with one given Unix ID?

Are you familiar with wakari.io? That's an example of what we want to do,
but with the ability to start jobs. Rescale.com is another example of a
web-based job submission mechanism.

JupyterHub is a fine front-end for what we want to do. All we need is a
qsub that is decouple from Unix user ids and allows for the creation of
clusters dynamically.

Rob


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-04 Thread Rob Nagler
Hi Daniel,

Thanks.

Shifter is also interesting. However, it assumes our users map to a Unix
user id, and therefore the access to the shared file system can be
controlled by normal Unix permissions. That's not scalable, and makes for
quite a bit of complexity. Each node must know about each user so you have
to run LDAP or something similar. This adds complexity to dynamic cluster
creation.

Shifter runs in a chroot, not an cgroup, context. For a supercomputer
center with an application process to get an account, this works fine. For
a web application with no "background check", it's more risky. At NERSC,
you don't have the bad actor problem. Web apps do, and all it takes is one
local exploit to escape chroot. Docker/cgroups is safer, and the focus on
improving Linux security is on cgroups these days, not chroot "jails".

Shifter also does not solve the problem of queuing dynamic clusters.
SLURM/Torque, which Shifter relies on, does not either. This is probably
the most difficult item. StarCluster does solve this problem, but doesn't
work on bare metal, and it's not clear if it is being maintained any more.

Rob


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-04 Thread Rob Nagler
Thanks! SLURM Elastic Computing seems like it might do the trick. I need to
try it out.

xCAT is interesting, too. It seems to be the HPC version of Salt'ed
Cobbler. :)  I don't know that it's so important for our problem. We have a
small cluster for testing against the cloud, primarily. I could see xCAT
being quite powerful for large clusters.

I'm not sure how to explain the Unix user id problem other than a gmail
account does not have a corresponding Unix user id. Nor do you have one for
your representation on this mailing list. That decoupling is important. The
actual execution of unix processes on behalf of users of gmail, this
mailing list, etc. run as a Unix single user. That's how JupyterHub
containers run. When you click "Start Server" in JupyterHub, it starts a
docker container as some system user (uid=1000 in our case), and the
container is given access to the user's files via a Docker volume. The
container cannot see any other user's files.

In a typical HPC context, the files are all in /home/. The
"containment" is done by normal Unix file permissions. It's very easy, but
it doesn't work for web apps as described above. Even being able to list
all the other users on a system (via "ls /home") is a privacy breach in a
web app.

Rob


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-06 Thread Rob Nagler
Thanks, John. I sometimes wonder if I'm the only one out there with this
particular problem.

Ralph, thanks for sticking with me. :) Using a pool of uids doesn't really
work due to the way cgroups/containers works. It also would require
changing the permissions of all of the user's files, which would create
issues for Jupyter/Hub's access to the files, which is used for in situ
monitoring.

Docker does not yet handle uid mapping at the container level (1.10 added
mappings for the daemon). We have solved this problem

by adding a uid/gid switcher at container startup for our images. The trick
is to change the uid/gid of the "container user" with usermod and groupmod.
This only works, however, with images we provide. I'd like a solution that
allows us to start arbitrary/unsafe images, relying on cgroups to their job.

Gilles, the containers do lock the user down, but the problem is that the
file system space has to be dynamically bound to the containers across the
cluster. JuptyerHub solves this problem by understanding the concept of a
user, and providing a hook to change the directory to be mounted.

Daniel, we've had bad experiences with ZoL. It's allocation algorithm
degrades rapidly when the file system gets over 80% full. It still is not
integrated into major distros, which leads to dkms nightmares on system
upgrades. I don't really see Flocker as helping in this regard, because the
problem is the scheduler, not the file system. We know which directory we
have to mount from the cluster file system, just need to get the scheduler
to allow us to mount that with the container that is running slurmd.

I'll play with Slurm Elastic Compute this week to see how it works.

Rob


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-06 Thread Rob Nagler
Ralph,

> FWIW: I haven’t seen it before.
>

Good to know.


>
> Not sure I understand the issue, but I have no knowledge of Jupyter or why
> you are using it. From what I can see, it appears that your choice of tools
> may be complicating your solution - I’d suggest perhaps focusing on solving
> the problem rather than trying to force-fit your current tools, but that
> presumes you don’t have some particular attachment to those tools.
>

No attachment to particular tools. This is a greenfield project. As to
Jupyter, it's fairly simple. JupyterHub launches Jupyter in a Docker
container with the user's directory, e.g. /var/nfs/jupyter/robnagler gets
mounted /home/docker-user. Then the user can bring a up a terminal and work
within the container as if they ssh'ed into the container, but without
having to run ssh.

Think of JupyterHub as way to provide login nodes without having a
corresponding Unix user.



>
> That isn’t the security hole - the issue is that Docker doesn’t prevent
> the user from taking privileged state, which means the user can become
> root. Yes, it is within that container - but the network and other 3rd
> party services can be vulnerable. Cgroups doesn’t really solve that problem
> as it still thinks the user is the one you originally set for the
> container, and constrains resources that way - but it doesn’t do
> authentication protection.
>

I think user namespaces

(Docker 1.10)  helps mitigate privilege escalation/container escape issues.
You can become root in the container, but outside the container, you are
some other user, like NFS's root_squash. Docker by default does contain the
network and other devices. I don't really know what MPI would require, but
it seems to work with TCP sockets, which don't allow spoofing.

One thing I'm assuming with all this is that people who run containers on
Docker Hub, Travis, Terminal.com, etc. have similar problems. They are
running jobs on behalf of web users (like our problem) as a single unix
user id. Docker Hub runs containers as root to build images so they must be
able to lock down containers well enough.

Another thing we can (and probably should) do is verify the images have no
setuid files. I think this would eliminate a lot of the privilege
escalation issues.

Rob


Re: [OMPI users] Docker Cluster Queue Manager

2016-06-22 Thread Rob Nagler
Good morning, Dave,

Amongst reasons for not running Docker, a major one that I didn't notice
> raised is that containers are not started by the resource manager, but
> by a privileged daemon, so the resource manager can't directly control
> or monitor them.
>

There's an endless debate  about this
between the docker and systemd folks. It is possible to get at the underlying
process  if a resource
manager wanted to.


> From a brief look at Jupyter when it came up a while ago, I wouldn't
> want to run it, and I wasn't alone.  (I've been lectured about the lack
> of problems with such things by people on whose clusters I could
> trivially run jobs as any normal user and sometimes as root.)
>

Well some people disagree, e.g. ipython.nersc.gov.  Our users like Jupyter.
It's my job to help them use it.

+1 for what Ralph said about singularity in particular.  While there's
> work to be done, you could even convert docker images on the fly in a
> resource manager prolog.  I'm awaiting enlightenment on the on-topic
> issue of running MPI jobs with it, though.
>
>
I don't see how Singularity addresses the problem of starting MPI inside
Docker.

In any event, our current plan is to bypass resource managers completely
and start an AWS fleet per user request. The code is much simpler for
everybody.

Rob