Re: A niche for the Hurd

Jason Cozens Wed, 29 Oct 2008 21:16:43 -0700

Sergiu Ivanov wrote:
> Is it possible to run only *parts* of the Hurd on separate computers,
> that is, make the Hurd a kind of a network operating system. It most
> probably sounds crazy, but I know that only a microkernel OS can do
> something like that, so the Hurd may be appropriate :-)
> 
> Sorry for being late with the idea :-(
> 
> scolobb
>


Apologies too for being late in posting this. Also apologies if this is
the incorrect place for this.

I'm not sure if this is the correct thread to post it to but I was
thinking of posting it to threads which have been referred to in this
thread. It also a bit "blue sky" (brainstorm).

I've been working on a distributed scheduler for some time and after
reading the papers by Walfield and Brinkman (see below) I've been
wondering whether the work I've been doing could be used in the Hurd?
What I've detailed so far in the work is more of a resource allocator
than a scheduler.

Below I've pasted the contents of an email I sent recently to a contact
and I have written a a fairly detailed report.

The report is at:

/doc/reports/report01.txt

http://bazaar.launchpad.net/~jason-cozens/eqp/trunk/files

Best Wishes,

Jason Cozens.

-----------------------------------------------------------------------
        N.H. Walfield, M. Brinkman
        "Improving Usability via Access Decomposition and Policy
        Reﬁnement"
        2007.
        http://www.walfield.org/papers
        /20070104-walfield-access-decomposition-policy-refinement.pdf

        N.H. Walfield, M. Brinkman
        "A Critique of the GNU Hurd Multi-server Operating System"
        2007
        http://www.walfield.org/papers
            /200707-walfield-critique-of-the-GNU-Hurd.pdf
------------------------------------------------------------------------

Reliable Computing - Main Objectives
------------------------------------

The major objective that I am working on is how to make general purpose
operating systems more reliable. Reliability in this context is
concerned with the ability to build component based systems whose
emergent behaviours can be proved correct from formal models of the
components.

With reliability comes the ability to build larger systems more easily.
I am investigating whether hardware isolation of processes can improve
reliability. I have taken this hardware isolation to the limit and
require that each "process" runs on its own isolated processor.

By having each process run on its own processor the possibility for
proving process behaviour correct becomes more tractable. It becomes
easier to formally define the processor architecture and the environment
in which a process runs. From this the formal proof of correctness of a
process's behaviour may be possible. Added to this the architecture put
forward below uses a very simple protocol for process creation
scheduling and resource allocation whose behaviour including failures
can be proved correct using "A Simple Broadcast Calculus" which I'm
currently writing up.

It's not enough to design a new architecture and expect it to be
adopted. For the new architecture to find widespread use it must support
current software and ideally current software development paradigms.

The new architecture should implement a POSIX compliant API. The
intention is that each fork()/exec*() process creation will result with
a new process on its own processor. As time goes on the creation of
processes within code can be optimised to the new kernel.

To this end the architecture is based on microkernels such as MINIX 3.
MINIX 3 has been used initially for clean and simple structure. The
real target operating system is the GNU Hurd.

Though reliability is a laudable goal in its own right, performance is
also crucial. The design is based on that of microkernels and at present
these are known to have poor performance compared to monolithic kernels
such as linux due to issues such as a large number of context switches
and issues with message passing. The new architecture would have no
context switching and would have message passing as a primitive. Each
process would have its own processor with its own address space
communicating with other processes using concurrent messages. When a
processor had nothing to do it would drop into a low power state waiting
for an interrupt or other event to restart it.

The objectives in design in order of importance are:

        1. Reliability.
        2. Ease of Software Migration.
        3. Performance.

Outline of Work so Far - Introduction to ROCK and EQP
-----------------------------------------------------

This section is a high level overview of the work I've done so far.

The main result of the work I have been doing so far is a kernel I call
ROCK (Reliable Optimising Concurrent Kernel). The kernel is made up of
a network of worker processors where each worker processor is tightly
coupled to one or more QCells dedicated to the worker.

A basic assumption of most OS design that processors have to be kept
busy is questioned and instead it is assumed that there is an abundance
of very simple processors (1000+) and communication is cheap.

The worker processors (Processor 1, 2, etc.) communicate with each other
using messages over the inter-processor communication (IPC) channels.
This is fairly standard.

The QCells (QCell 1, 2, etc.) communicate with each other using
broadcast messages over the inter-qcell communication (IQC) channels.
Each Processor / QCell coupling communicates using a private channel.

The diagram below shows the case where each Processor has one QCell.


     IPC                                            IQC
     Channels                                      Channels
                 +-------------+    +---------+
           !/?---o Processor 1 o----o QCell 1 o---!/?
            |    +-------------+    +---------+    |
            |                                      |
            |    +-------------+    +---------+    |
           !/?---o Processor 2 o----o QCell 2 o---!/?
            |    +-------------+    +---------+    |
            |                                      |
            |    +-------------+    +---------+    |
           !/?---o Processor 3 o----o QCell 3 o---!/?
            |    +-------------+    +---------+    |
            |                                      |
            |    +-------------+    +---------+    |
           !/?---o Processor 4 o----o QCell 4 o---!/?
            |    +-------------+    +---------+    |
            |                                      |
            |    +-------------+    +---------+    |
           !/?---o Processor 5 o----o QCell 5 o---!/?
            |    +-------------+    +---------+    |
            .                                      .
            .                                      .
            .                                      .


If we let { --- } represent a process slot on a processor and { p x }
represent process x running on a processor the ROCK kernel in an
unassigned state can be depicted as:


    { --- }     { --- }     { --- }     { --- }     { --- }
    +-----+     +-----+     +-----+     +-----+     +-----+
    | P 1 |     | P 2 |     | P 3 |     | P 4 |     | P 5 |
IPC o-----o-----o-----o-----o-----o-----o-----o-----o-----o...
       |           |           |           |           |
    +-----+     +-----+     +-----+     +-----+     +-----+
    | Q 1 |     | Q 2 |     | Q 3 |     | Q 4 |     | Q 5 |
    +-----+     +-----+     +-----+     +-----+     +-----+
       |           |           |           |           |
IQC   !/?---------!/?---------!/?---------!/?---------!/?.....


The purpose of the QCells is to manage the allocation of processes to
processors or more generally the allocation of resources. This is
achieved by the QCells maintaining a shared token that manages an eager
queue. An eager queue is based on the idea that every processor in the
queue wants to find some work and will keep announcing its availability
until it gets some.

The protocol running on the IQC is an Eager Queue Protocol (EQP). This
is a protocol that is designed to:

    * Be responsive - work requests should be actioned quickly.
    * Minimise collisions.
    * Minimise the last broadcast time for all processors.
        - This is so processors are "known" to be alive.
    * Provide state information on the network.
        - This allows processors to be switched on and off as the work
          load varies.
        - More generally it forms the basis for a feedback control
          system.
    * Provide fault tolerance.

To illustrate the use of ROCK, the following diagram shows how it is
hoped to fit ROCK into a MINIX 3 like microkernel architecture.


          +------------------------------------------------------------+
Layer 4.  | User Process                        { p 4 }                |
          +------------------------------------------------------------+
Layer 3.  | Server Process                                  { p 5 }    |
          +------------------------------------------------------------+
Layer 2.  | Device Drivers          { p 3 }                            |
          +------------------------------------------------------------+
Layer 1.  | { p 1 }     { p 2 }                                        |
          +------------------------------------------------------------+
            +-----+     +-----+     +-----+     +-----+     +-----+
            | P 1 |     | P 2 |     | P 3 |     | P 4 |     | P 5 |
        IPC o-----o-----o-----o-----o-----o-----o-----o-----o-----o...
               |           |           |           |           |
            +-----+     +-----+     +-----+     +-----+     +-----+
            | Q 1 |     | Q 2 |     | Q 3 |     | Q 4 |     | Q 5 |
            +-----+     +-----+     +-----+     +-----+     +-----+
               |           |           |           |           |
        IQC   !/?---------!/?---------!/?---------!/?---------!/?.....


In this architecture the virtual processor architecture of traditional
posix operating systems is realised using "concrete-virtualisation".
That is each process runs on a concrete processor but the network is
virtualised the resource allocation of an eager queue protocol.

Looking at the design of the kernel it meets the minimal design
pioneered by Brinch Hansen's Nucleus (Hansen, 1970):

    1. some mechanism for dealing with address spaces — this is
       required for managing memory protection;
    2. some execution abstraction to manage CPU allocation
    3. inter-process communication

These are satisfied as follows:

    1. Each process would have its own address space as it is running
       on its own processor.
    2. The Eager Queue systems is a general resource allocation system
       and can allocate processors (CPUs) as well as other resources.
    3. Message passing would be a primitive in the system in a similar
       way to Erlang (http://www.erlang.org/).

The major difference to Nucleus is the creation of processes. Processes
in ROCK are created by finding a free processor.

Conceptually the system has similarities to Singularity (Fahndrich,
Aiken, et al., 2006 and Hunt, Larus, 2007). However rather than
"Rethinking the Software Stack" and implementing the isolation of
processes using complex software, ROCK "Flattens the Software Stack".
This gives the software developer a simple model akin to the programming
model of real time and embedded systems. This model hopefully avoids
much of the non-determinism of existing general purpose operating
systems.

Background Work
---------------

This section summarises the work that I have completed so far. It is
all publicly available and all code has been written under GPL v3.0.

In January 2006 I started making some notes on eager queues on
NovellForge, these are fairly brief and can be found at:

    http://developer.novell.com/wiki/index.php/OpenEd_-_Lab_4

In January 2007 I moved to SourceForge and explored writing a simulation
in CS, I made some progress with this but it was slow. I also started
some work on Broadcast Calculi which lead to "A Simple Broadcast
Calculus".

In February 2008 I moved to hosting the project in Launchpad. Since
hosting the project here I have written a python simulation of an eager
queue protocol and started work on an implementation using Erlang. So
far in Erlang I have written a queue protocol.

Motivation
----------

A general purpose operating system's main function is to share out the
CPU amongst processes. A process is a program running on a virtual
processor. The operating system implements the virtual processors for
the processes.

One of the major benefits of a general purpose operating system is that
its virtual processor architecture is not static but can vary
dynamically almost without limit. This is achieved by the ability for
processes to create and destroy other processes. This functionality is
also implemented by the operating system using functions such as fork,
spawn, kill, etc.

The operating system can be seen as having an unlimited pool of virtual
processors which processes can use to create dynamically varying
networks.

The versatility of a general purpose operating system often comes at the
cost of reliability and predictability largely stemming from the
non-deterministic programming model that is a result of its design. The
non-determinism arises from the multiprogramming model. The ordering of
process execution is controlled by the operating system.

A process is active only when it is in control of the program counter.
In the simplest systems there is only one program counter. The operating
system behaves as a largely unobservable and unpredictable adjudicator
in a competition for the program counter. Though each process will
control the program counter in a deterministic way the real time
properties of the program counter and hence the processes are lost. This
leads to non-deterministic program models such as threading.

The complexities of programming models such as threading and the fact
that in reality all processes are really running as one process in
one address space leads to much of the unreliability of general purpose
operating systems.

Processes do not run in isolation but require to communicate with each
other. As well as providing a pool of virtual processors an operating
system needs to provide a communication mechanism between processes.
This mechanism is called inter-process communication.

Operating systems have become more powerful largely through hardware
improvements that have permitted the continuation of the many processes
running as one process model. Attempts to use more physical processors
have often failed due to the lack of versatility in the resulting
architectures.

Is it possible to replace the pool of virtual processors with a pool of
real processors without losing the versatility of the single process
model? The implementation of such a system using a distributed resource
allocation mechanism is the central aim of the work reported here.

There are many ways of implementing the communication infrastructure.
For example to create a dynamic network we can use local wireless
communications. Different frequencies can be used by different process
groups. Inter-process communications would have their own
communication channels/frequencies.

The major problem is how to manage the pool of processors. This can be
implemented by doing away with the centralised scheduler and letting
each processor manage itself. The problem is now really one of resource
allocation rather than scheduling. So that the resource allocation is
robust there needs to be no single point of failure. To maintain
versatility the state of the resource allocation needs to be observable
between the processors. To achieve this a set of broadcast based
protocols is proposed called Eager Queue Protocols (EQP).

The following hints at an equivalent to a fork() in such a system.

       pid = iqc.Request()
       ipc.Send(pid, program)

EQP is then used in a new general purpose kernel called ROCK.

ROCK - Robust Optimising Concurrent Kernel
------------------------------------------

ROCK uses EQP to replace the scheduling and process creation part of
the kernel. If we consider MINIX 3 as an example of a microkernel it is
possible to see where ROCK fits into a general purpose architecture.

MINIX 3 is a clean implementation of a microkernel design and allows us
to concentrate on what is in the "core" of the operating system.


          +------------------------------------------------------------+
Layer 4.  | User Process                                               |
          +------------------------------------------------------------+
Layer 3.  | Server Process                                             |
          +------------------------------------------------------------+
Layer 2.  | Device Drivers                                             |
          +------------------------------------------------------------+
Layer 1.  | Kernel                          | System Task | Clock Task |
          +------------------------------------------------------------+


The layer we are looking at replacing is Layer 1, the Kernel. In Minix
the kernel implements the following:

       * Manage Interrupts and Traps
       * Saving and Restoring Registers
       * Providing Services to the Next Layer
       * Scheduling
       * Messaging and Communication services

In ROCK we are now running on multiple processors. Interrupts are
handled by dedicated processors, there will be less need for Traps.
Saving and Restoring registers will be required much less as each
process will now be running on its own processor. Services to
the next layer will need further consideration. Scheduling will be
handled by EQP. Message and Communication services will be handled by
each processor directly.

The architecture now looks as follows:


          +------------------------------------------------------------+
Layer 4.  | User Process                                               |
          +------------------------------------------------------------+
Layer 3.  | Server Process                                             |
          +------------------------------------------------------------+
Layer 2.  | Device Drivers                                             |
          +------------------------------------------------------------+
Layer 1.  | ROCK                                                       |
          +------------------------------------------------------------+


In the hadware we now have a network of Processor / QCell pairs (more
generally there can be many QCells for a given processor). The QCells
communicate over inter-qcell-channels whilst the processors communicate
over inter-processor-channels.


          +----------------------------------------------------------...
Layer 4.  | User Process
          +----------------------------------------------------------...
Layer 3.  | Server Process                                  { p 5 }
          +----------------------------------------------------------...
Layer 2.  | Device Drivers          { p 3 }
          +----------------------------------------------------------...
Layer 1.  | { p 1 }     { p 2 }                 { p 4 }
          +----------------------------------------------------------...
            +-----+     +-----+     +-----+     +-----+     +-----+
            | P 1 |     | P 2 |     | P 3 |     | P 4 |     | P 5 |
        IPC o-----o-----o-----o-----o-----o-----o-----o-----o-----o...
               |           |           |           |           |
            +-----+     +-----+     +-----+     +-----+     +-----+
            | Q 1 |<--->| Q 2 |     | Q 3 |     | Q 4 |     | Q 5 |
            +-----+     +-----+     +-----+     +-----+     +-----+
               |           |           |           |           |
        IQC   !/?---------!/?---------!/?---------!/?---------!/?.....


The QCells use an Eager Queue Protocol to communicate over the IQCs.

EQP (Eager Queue Protocols)
---------------------------

Eager Queue Protocols are simple protocols based around a small set of
messages. A message consists of A Sender, an Action and an EQ data
structure. The message is broadcast over an inter-qcell channel (IQC).
There can be many IQCs and these can be created dynamically. Some will
be short lived, while some IQCs will outlive the lives of individual
processors.

Taking iqc as an IQC, then a broadcast can be defined as follows (as
seen by an observer of the channel).

        iqc.{Sender, Action, EQ}

There are 6 Actions:

        * JOIN
        * UPDATE
        * EXIT
        * REQUEST
        * ACCEPT
        * DECLINE

At its simplest the EQ data structure is simply a list of processors
with the processors listed in the order of last broadcast with the most
recent at the head of the list. For example in:

        [p2, p3, p1, p5, p9]

p2 was the QCell that most recently broadcast. p9 was the lease recent
broadcaster. More complex EQ data structures can be defined.

The mechanics of an eager queue protocol are based on the following:

        * Multicast (broadcast) messaging
        * Heartbeat
        * Token

Multicasts/Broadcasts are used so that all processors linked into an
eager queue are constantly updated.

Update messages are sent regularly (possibly at 10KHz or more) using a
distributed heartbeat. This is so the system is responsive, that is
requests can be satisfied quickly and new processors can join quickly
and existing processors can exit quickly. In its quiet state observing
an eager queue channel will result in the following sequence of Actions:

        wait -> UPDATE -> wait -> UPDATE -> wait -> UPDATE -> . . .

The eager queue data structure forms a shared token that all QCells
have. This has some similarities to the use of tokens in (Hansen,
Jansen, et al., 2005). The token is used to establish who is next to
broadcast, when requests can be made, who will accept a request and also
to minimise collisions during requests and state changes. If we look at
the complete sequence of UPDATE messages we would see something like:

        * wait
        * {p2, UPDATE, [p2, p3, p1, p5, p9]}
        * wait
        * {p9, UPDATE, [p9, p2, p3, p1, p5]}
        * wait
        * {p5, UPDATE, [p5, p9, p2, p3, p1]}
        * wait
        * {p1, UPDATE, [p1, p5, p9, p2, p3]}
        * wait
        * {p3, UPDATE, [p3, p1, p5, p9, p2]}
        * wait
        * {p2, UPDATE, [p2, p3, p1, p5, p9]}
        * . . .

The UPDATE messages act as a synchronising pulse. All other messages
will occur after an UPDATE. This is key to reducing collisions.

Monitoring the IQC, message sequences will always be of the form:

    UPDATE -> MESSAGE* -> wait-> UPDATE -> MESSAGE* -> wait -> UPDATE...

Where MESSAGE := JOIN | EXIT | REQUEST | ACCEPT | DECLINE

The exact sequences of messages can be defined as a regular expression.
This is the characteristic signature of an Eager Queue Protocol:

    (->wait->UPDATE((->REQUEST->DECLINE*->ACCEPT)|->JOIN|->EXIT)*)*

An UPDATE will only ever occur after a quiet (wait) interval on the IQC.

If a processor wants to REQUEST another processor, it does so by
broadcasting a REQUEST after an UPDATE. The exact time it can broadcast
the REQUEST is determined by its position in the eager queue. EQP
dictates that a REQUEST must be followed by an ACCEPT or DECLINE. In
the current EQP implementation the processor that must accept is the
processor at the head of the queue, i.e. the most recent to broadcast
(the choice of processor can be changed, e.g. to the processor that has
been waiting longest etc.).

        * {UPDATE,       [p2, p3, p1, p5, p9]}
        * {REQUEST(p99), [p2, p3, p1, p5, p9]}
        * {ACCEPT,       [p3, p1, p5, p9]}

In the current implementation when an ACCEPT happens the accepting
processor is removed from the queue, this may also be changed so that
the state of processors is kept in the queue, also in the current
implementation the requesting processor does not have to be in the
queue.

Because the whole system has no hard handshakes processors need to be
able to decline a REQUEST, for example when a processor is shutting down
or just pulling out of the queue. If a processor declines then the next
processor in line has to ACCEPT or DECLINE, this continues until the
REQUEST is accepted or the system is exhausted. Exhaustion should be
rare as it is assumed that there is a large number of available
processors and processors switch on when the supply of available
processors is low.

        * {UPDATE,       [p2, p3, p1, p5, p9]}
        * {REQUEST(p99), [p2, p3, p1, p5, p9]}
        * {DECLINE,      [p3, p1, p5, p9]}
        * {DECLINE,      [p1, p5, p9]}
        * {ACCEPT,       [p5, p9]}

If we add requesting processors into the queue then collisions can be
further avoided by allocating timeslots when requests can be made. The
time slots can be as small as the signal detection time.

|<- UPDATE ->|<- r1 ->|<- r2 ->| . . . |<- rn ->|<- JOIN/EXIT ->|

If there are a large number of potential requesting processors rather
than have n time-slots, e.g.

|<- UPDATE ->|<- r1 ->|<- r2 ->| . . . |<- r16 ->|<- JOIN/EXIT ->|

The transmission slots can be arranged in a square grid:

|<- UPDATE ->|<- r1  ->|<- r2  ->|<- r3  ->|<- r4  ->|<- JOIN/EXIT ->|
             |<- r5  ->|<- r6  ->|<- r7  ->|<- r8  ->|
             |<- r9  ->|<- r10 ->|<- r11 ->|<- r12 ->|
             |<- r13 ->|<- r14 ->|<- r15 ->|<- r16 ->|

If there is a collision in a time slot the column is flattened to
resolve the collision:

|<- UPDATE ->|<- ! ->|<- r1  ->|<- r5  ->|<- r9  ->|<- r13 ->|<- r2  ->|
             |<- ! ->|                                       |<- r6  ->|
             |<- ! ->|                                       |<- r10 ->|
             |<- ! ->|                                       |<- r14 ->|

The time slots can be tuned statistically based on the history of the
channel.

The JOIN and EXIT actions allow processors to join and exit an eager
queue.

        * wait
        * {p2, UPDATE, [p2, p3, p1, p5, p9]}
        * {p1, EXIT,   [p2, p3, p5, p9]}
        * {p99, JOIN,  [p99, p2, p3, p5, p9]}
        * wait
        * {p9, UPDATE, [p9, p99, p2, p3, p5]}
        * ...

The above has just outlined how an Eager Queue Protocol works. More
detail can be found on the Launchpad EQP project site:

        * https://launchpad.net/eqp

REFERENCES
----------

(Fahndrich, Aiken, et al., 2006)

    "Language Support for Fast and Reliable Message-based Communication
    in Singularity OS"
    Fahndrich, M., Aiken, M., Chris Hawblitzel, Orion Hodson, Galen
    Hunt, James R. Larus and Steven Levi.
    EuroSys06, April 18-21, 2006.

(Hansen, 1970)

    "The nucleus of a multiprogramming system"
    Hansen, Per Brinch
    Communications of the ACM
    Volume 13 ,  Issue 4  (April 1970)
    Pages: 238 - 241
    Year of Publication: 1970
    ISSN:0001-0782

(Hansen, Jansen, et al., 2005)

    "RTnet: a distributed real-time protocol for broadcast-capable
    networks"
    Hanssen, F.; Jansen, P.G.; Scholten, H.; Mullender, S.
    Autonomic and Autonomous Systems and International Conference on
    Networking and Services, 2005.
    ICAS-ICNS 2005. Joint International Conference on
    Volume , Issue , 23-28 Oct. 2005 Page(s): 18 - 18
    Digital Object Identifier   10.1109/ICAS-ICNS.2005.85

(Hunt, Larus, 2007)

    "Singularity: Rethinking the Software Stack"
    Hunt, Galen C., Larus, James R.
    Microsoft Research Labs, 2007.

(Thomsen, 1989)

    "A calculus of higher order communicating systems"
    Thomsen, Bent
    Annual Symposium on Principles of Programming Languages
    Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles
    of programming languages
    Austin, Texas, United States
    Pages: 143 - 154
    Year of Publication: 1989
    ISBN:0-89791-294-2

Re: A niche for the Hurd

Reply via email to