Hi -

I've been wondering about using Hurd on a cluster computer; i.e, a
configuration where each node has multiple identical cores and its own
memory.  For example, an eight node cluster where each node has 8 GB of RAM
and eight cores.  I stress that the cores are identical, so that processes
can run on any core and even migrate between them.

I'd like to see the whole thing running an integrated POSIX operating
system.  So, when I run 'top', I see 64 processors and 64 GB of RAM.

Code tuned for this architecture might behave like this:  A program forks
eight processes, and each process spawns eight threads.  Our basic
programming paradigm is that threads share memory and run on the same node,
while processes do not share memory and likely run on different nodes.  We
can setup shared memory between processes (System V IPC), but we know that
this is expensive because it has to be emulated, so we try to avoid it.
Process migration is expensive, too, so we try to avoid it as well.  In the
example, we migrate seven of our eight processes, right at the beginning,
during program initialization, take the performance hit once, then leave
them to run on their separate nodes.

All we really have between the nodes is a fast LAN, so we can do message
passing.  Yet that's exactly what Mach/Hurd is designed for - a kernel
built around message passing.

Can Hurd work, well, in such an environment?

Since I haven't dug into a single line of Hurd, I don't know - that's why
I'm asking.  I've done some homework, though, and there are some things
that I am aware of.

First, it's basically Mach that would have to be modified, right?  Changes
to Hurd servers might be required for performance reasons, but so long as
Mach works on the cluster, Hurd should work.

Next, Mach/Hurd's memory limitations and 32-bit pointers.  My first through
was to ignore it for right now, since these are well known problems.  If we
could get Hurd running at all on a cluster computer, then we've have to
come back and make sure it can actually use the entire 8 GB of RAM on a
single node.  Yet I'm not sure.  There might be situations where we have to
address the entire cluster's RAM, even though accessing a non-local part of
it will be slow.

Sending large blocks of data in Mach messages becomes problematic, since we
can't play shared memory games.  It would have to be emulated, and avoided
whenever possible.  These are the kinds of changes that would be needed to
the Hurd servers themselves - they can no longer assume that firing virtual
memory across a port is fast.

In-order and guaranteed delivery.  For the moment, let's assume that our
LAN can do this natively.  Since we're not going through routers, only a
single Ethernet switch, maybe virtualized, this might work.

Can a Hurd network driver be built to pass kernel messages, or is this a
huge problem?  Something like, you load an Ethernet driver, and it has some
kind of interface that allows Mach messages to be passed through it?

Protected data, like port rights - let's assume that we use a dedicated
Ethertype that isn't routed and can't be addressed by anything but trusted
Mach kernels.  Yes, this means that our Ethernet driver now becomes a
potential security hole that can be used to steal port rights, but let's
keep noting and then ignoring stuff like that...

And, oh yes, a "Mach kernel" is now something that runs across multiple
processors with no shared memory.  This is the biggest problem that I can
see - Mach is multithreaded, so that's not a problem, but I'll bet it
assumes shared memory structures between the threads, and that's pervasive
in all its code.  Am I right?

If so, then the first step would be to modify Mach, probably throughout its
code, so that it can handle threads with no shared memory between them,
only a communication interface provided by a network driver.  That gets it
running on a cluster, then we need to remove the memory limitations, and
start tuning things to make it run well.

The payoff is a supercomputer operating system that presents an entire
cluster as a single POSIX system with hundreds of processors, terabytes of
RAM, and petabytes of disk space.

Any thoughts?

Any prior work in this direction?

Thank you!

    agape
    brent

Reply via email to