Sergiu Ivanov wrote: > Is it possible to run only *parts* of the Hurd on separate computers, > that is, make the Hurd a kind of a network operating system. It most > probably sounds crazy, but I know that only a microkernel OS can do > something like that, so the Hurd may be appropriate :-) > > Sorry for being late with the idea :-( > > scolobb >
Apologies too for being late in posting this. Also apologies if this is the incorrect place for this. I'm not sure if this is the correct thread to post it to but I was thinking of posting it to threads which have been referred to in this thread. It also a bit "blue sky" (brainstorm). I've been working on a distributed scheduler for some time and after reading the papers by Walfield and Brinkman (see below) I've been wondering whether the work I've been doing could be used in the Hurd? What I've detailed so far in the work is more of a resource allocator than a scheduler. Below I've pasted the contents of an email I sent recently to a contact and I have written a a fairly detailed report. The report is at: /doc/reports/report01.txt http://bazaar.launchpad.net/~jason-cozens/eqp/trunk/files Best Wishes, Jason Cozens. ----------------------------------------------------------------------- N.H. Walfield, M. Brinkman "Improving Usability via Access Decomposition and Policy Refinement" 2007. http://www.walfield.org/papers /20070104-walfield-access-decomposition-policy-refinement.pdf N.H. Walfield, M. Brinkman "A Critique of the GNU Hurd Multi-server Operating System" 2007 http://www.walfield.org/papers /200707-walfield-critique-of-the-GNU-Hurd.pdf ------------------------------------------------------------------------ Reliable Computing - Main Objectives ------------------------------------ The major objective that I am working on is how to make general purpose operating systems more reliable. Reliability in this context is concerned with the ability to build component based systems whose emergent behaviours can be proved correct from formal models of the components. With reliability comes the ability to build larger systems more easily. I am investigating whether hardware isolation of processes can improve reliability. I have taken this hardware isolation to the limit and require that each "process" runs on its own isolated processor. By having each process run on its own processor the possibility for proving process behaviour correct becomes more tractable. It becomes easier to formally define the processor architecture and the environment in which a process runs. From this the formal proof of correctness of a process's behaviour may be possible. Added to this the architecture put forward below uses a very simple protocol for process creation scheduling and resource allocation whose behaviour including failures can be proved correct using "A Simple Broadcast Calculus" which I'm currently writing up. It's not enough to design a new architecture and expect it to be adopted. For the new architecture to find widespread use it must support current software and ideally current software development paradigms. The new architecture should implement a POSIX compliant API. The intention is that each fork()/exec*() process creation will result with a new process on its own processor. As time goes on the creation of processes within code can be optimised to the new kernel. To this end the architecture is based on microkernels such as MINIX 3. MINIX 3 has been used initially for clean and simple structure. The real target operating system is the GNU Hurd. Though reliability is a laudable goal in its own right, performance is also crucial. The design is based on that of microkernels and at present these are known to have poor performance compared to monolithic kernels such as linux due to issues such as a large number of context switches and issues with message passing. The new architecture would have no context switching and would have message passing as a primitive. Each process would have its own processor with its own address space communicating with other processes using concurrent messages. When a processor had nothing to do it would drop into a low power state waiting for an interrupt or other event to restart it. The objectives in design in order of importance are: 1. Reliability. 2. Ease of Software Migration. 3. Performance. Outline of Work so Far - Introduction to ROCK and EQP ----------------------------------------------------- This section is a high level overview of the work I've done so far. The main result of the work I have been doing so far is a kernel I call ROCK (Reliable Optimising Concurrent Kernel). The kernel is made up of a network of worker processors where each worker processor is tightly coupled to one or more QCells dedicated to the worker. A basic assumption of most OS design that processors have to be kept busy is questioned and instead it is assumed that there is an abundance of very simple processors (1000+) and communication is cheap. The worker processors (Processor 1, 2, etc.) communicate with each other using messages over the inter-processor communication (IPC) channels. This is fairly standard. The QCells (QCell 1, 2, etc.) communicate with each other using broadcast messages over the inter-qcell communication (IQC) channels. Each Processor / QCell coupling communicates using a private channel. The diagram below shows the case where each Processor has one QCell. IPC IQC Channels Channels +-------------+ +---------+ !/?---o Processor 1 o----o QCell 1 o---!/? | +-------------+ +---------+ | | | | +-------------+ +---------+ | !/?---o Processor 2 o----o QCell 2 o---!/? | +-------------+ +---------+ | | | | +-------------+ +---------+ | !/?---o Processor 3 o----o QCell 3 o---!/? | +-------------+ +---------+ | | | | +-------------+ +---------+ | !/?---o Processor 4 o----o QCell 4 o---!/? | +-------------+ +---------+ | | | | +-------------+ +---------+ | !/?---o Processor 5 o----o QCell 5 o---!/? | +-------------+ +---------+ | . . . . . . If we let { --- } represent a process slot on a processor and { p x } represent process x running on a processor the ROCK kernel in an unassigned state can be depicted as: { --- } { --- } { --- } { --- } { --- } +-----+ +-----+ +-----+ +-----+ +-----+ | P 1 | | P 2 | | P 3 | | P 4 | | P 5 | IPC o-----o-----o-----o-----o-----o-----o-----o-----o-----o... | | | | | +-----+ +-----+ +-----+ +-----+ +-----+ | Q 1 | | Q 2 | | Q 3 | | Q 4 | | Q 5 | +-----+ +-----+ +-----+ +-----+ +-----+ | | | | | IQC !/?---------!/?---------!/?---------!/?---------!/?..... The purpose of the QCells is to manage the allocation of processes to processors or more generally the allocation of resources. This is achieved by the QCells maintaining a shared token that manages an eager queue. An eager queue is based on the idea that every processor in the queue wants to find some work and will keep announcing its availability until it gets some. The protocol running on the IQC is an Eager Queue Protocol (EQP). This is a protocol that is designed to: * Be responsive - work requests should be actioned quickly. * Minimise collisions. * Minimise the last broadcast time for all processors. - This is so processors are "known" to be alive. * Provide state information on the network. - This allows processors to be switched on and off as the work load varies. - More generally it forms the basis for a feedback control system. * Provide fault tolerance. To illustrate the use of ROCK, the following diagram shows how it is hoped to fit ROCK into a MINIX 3 like microkernel architecture. +------------------------------------------------------------+ Layer 4. | User Process { p 4 } | +------------------------------------------------------------+ Layer 3. | Server Process { p 5 } | +------------------------------------------------------------+ Layer 2. | Device Drivers { p 3 } | +------------------------------------------------------------+ Layer 1. | { p 1 } { p 2 } | +------------------------------------------------------------+ +-----+ +-----+ +-----+ +-----+ +-----+ | P 1 | | P 2 | | P 3 | | P 4 | | P 5 | IPC o-----o-----o-----o-----o-----o-----o-----o-----o-----o... | | | | | +-----+ +-----+ +-----+ +-----+ +-----+ | Q 1 | | Q 2 | | Q 3 | | Q 4 | | Q 5 | +-----+ +-----+ +-----+ +-----+ +-----+ | | | | | IQC !/?---------!/?---------!/?---------!/?---------!/?..... In this architecture the virtual processor architecture of traditional posix operating systems is realised using "concrete-virtualisation". That is each process runs on a concrete processor but the network is virtualised the resource allocation of an eager queue protocol. Looking at the design of the kernel it meets the minimal design pioneered by Brinch Hansen's Nucleus (Hansen, 1970): 1. some mechanism for dealing with address spaces — this is required for managing memory protection; 2. some execution abstraction to manage CPU allocation 3. inter-process communication These are satisfied as follows: 1. Each process would have its own address space as it is running on its own processor. 2. The Eager Queue systems is a general resource allocation system and can allocate processors (CPUs) as well as other resources. 3. Message passing would be a primitive in the system in a similar way to Erlang (http://www.erlang.org/). The major difference to Nucleus is the creation of processes. Processes in ROCK are created by finding a free processor. Conceptually the system has similarities to Singularity (Fahndrich, Aiken, et al., 2006 and Hunt, Larus, 2007). However rather than "Rethinking the Software Stack" and implementing the isolation of processes using complex software, ROCK "Flattens the Software Stack". This gives the software developer a simple model akin to the programming model of real time and embedded systems. This model hopefully avoids much of the non-determinism of existing general purpose operating systems. Background Work --------------- This section summarises the work that I have completed so far. It is all publicly available and all code has been written under GPL v3.0. In January 2006 I started making some notes on eager queues on NovellForge, these are fairly brief and can be found at: http://developer.novell.com/wiki/index.php/OpenEd_-_Lab_4 In January 2007 I moved to SourceForge and explored writing a simulation in CS, I made some progress with this but it was slow. I also started some work on Broadcast Calculi which lead to "A Simple Broadcast Calculus". In February 2008 I moved to hosting the project in Launchpad. Since hosting the project here I have written a python simulation of an eager queue protocol and started work on an implementation using Erlang. So far in Erlang I have written a queue protocol. Motivation ---------- A general purpose operating system's main function is to share out the CPU amongst processes. A process is a program running on a virtual processor. The operating system implements the virtual processors for the processes. One of the major benefits of a general purpose operating system is that its virtual processor architecture is not static but can vary dynamically almost without limit. This is achieved by the ability for processes to create and destroy other processes. This functionality is also implemented by the operating system using functions such as fork, spawn, kill, etc. The operating system can be seen as having an unlimited pool of virtual processors which processes can use to create dynamically varying networks. The versatility of a general purpose operating system often comes at the cost of reliability and predictability largely stemming from the non-deterministic programming model that is a result of its design. The non-determinism arises from the multiprogramming model. The ordering of process execution is controlled by the operating system. A process is active only when it is in control of the program counter. In the simplest systems there is only one program counter. The operating system behaves as a largely unobservable and unpredictable adjudicator in a competition for the program counter. Though each process will control the program counter in a deterministic way the real time properties of the program counter and hence the processes are lost. This leads to non-deterministic program models such as threading. The complexities of programming models such as threading and the fact that in reality all processes are really running as one process in one address space leads to much of the unreliability of general purpose operating systems. Processes do not run in isolation but require to communicate with each other. As well as providing a pool of virtual processors an operating system needs to provide a communication mechanism between processes. This mechanism is called inter-process communication. Operating systems have become more powerful largely through hardware improvements that have permitted the continuation of the many processes running as one process model. Attempts to use more physical processors have often failed due to the lack of versatility in the resulting architectures. Is it possible to replace the pool of virtual processors with a pool of real processors without losing the versatility of the single process model? The implementation of such a system using a distributed resource allocation mechanism is the central aim of the work reported here. There are many ways of implementing the communication infrastructure. For example to create a dynamic network we can use local wireless communications. Different frequencies can be used by different process groups. Inter-process communications would have their own communication channels/frequencies. The major problem is how to manage the pool of processors. This can be implemented by doing away with the centralised scheduler and letting each processor manage itself. The problem is now really one of resource allocation rather than scheduling. So that the resource allocation is robust there needs to be no single point of failure. To maintain versatility the state of the resource allocation needs to be observable between the processors. To achieve this a set of broadcast based protocols is proposed called Eager Queue Protocols (EQP). The following hints at an equivalent to a fork() in such a system. pid = iqc.Request() ipc.Send(pid, program) EQP is then used in a new general purpose kernel called ROCK. ROCK - Robust Optimising Concurrent Kernel ------------------------------------------ ROCK uses EQP to replace the scheduling and process creation part of the kernel. If we consider MINIX 3 as an example of a microkernel it is possible to see where ROCK fits into a general purpose architecture. MINIX 3 is a clean implementation of a microkernel design and allows us to concentrate on what is in the "core" of the operating system. +------------------------------------------------------------+ Layer 4. | User Process | +------------------------------------------------------------+ Layer 3. | Server Process | +------------------------------------------------------------+ Layer 2. | Device Drivers | +------------------------------------------------------------+ Layer 1. | Kernel | System Task | Clock Task | +------------------------------------------------------------+ The layer we are looking at replacing is Layer 1, the Kernel. In Minix the kernel implements the following: * Manage Interrupts and Traps * Saving and Restoring Registers * Providing Services to the Next Layer * Scheduling * Messaging and Communication services In ROCK we are now running on multiple processors. Interrupts are handled by dedicated processors, there will be less need for Traps. Saving and Restoring registers will be required much less as each process will now be running on its own processor. Services to the next layer will need further consideration. Scheduling will be handled by EQP. Message and Communication services will be handled by each processor directly. The architecture now looks as follows: +------------------------------------------------------------+ Layer 4. | User Process | +------------------------------------------------------------+ Layer 3. | Server Process | +------------------------------------------------------------+ Layer 2. | Device Drivers | +------------------------------------------------------------+ Layer 1. | ROCK | +------------------------------------------------------------+ In the hadware we now have a network of Processor / QCell pairs (more generally there can be many QCells for a given processor). The QCells communicate over inter-qcell-channels whilst the processors communicate over inter-processor-channels. +----------------------------------------------------------... Layer 4. | User Process +----------------------------------------------------------... Layer 3. | Server Process { p 5 } +----------------------------------------------------------... Layer 2. | Device Drivers { p 3 } +----------------------------------------------------------... Layer 1. | { p 1 } { p 2 } { p 4 } +----------------------------------------------------------... +-----+ +-----+ +-----+ +-----+ +-----+ | P 1 | | P 2 | | P 3 | | P 4 | | P 5 | IPC o-----o-----o-----o-----o-----o-----o-----o-----o-----o... | | | | | +-----+ +-----+ +-----+ +-----+ +-----+ | Q 1 |<--->| Q 2 | | Q 3 | | Q 4 | | Q 5 | +-----+ +-----+ +-----+ +-----+ +-----+ | | | | | IQC !/?---------!/?---------!/?---------!/?---------!/?..... The QCells use an Eager Queue Protocol to communicate over the IQCs. EQP (Eager Queue Protocols) --------------------------- Eager Queue Protocols are simple protocols based around a small set of messages. A message consists of A Sender, an Action and an EQ data structure. The message is broadcast over an inter-qcell channel (IQC). There can be many IQCs and these can be created dynamically. Some will be short lived, while some IQCs will outlive the lives of individual processors. Taking iqc as an IQC, then a broadcast can be defined as follows (as seen by an observer of the channel). iqc.{Sender, Action, EQ} There are 6 Actions: * JOIN * UPDATE * EXIT * REQUEST * ACCEPT * DECLINE At its simplest the EQ data structure is simply a list of processors with the processors listed in the order of last broadcast with the most recent at the head of the list. For example in: [p2, p3, p1, p5, p9] p2 was the QCell that most recently broadcast. p9 was the lease recent broadcaster. More complex EQ data structures can be defined. The mechanics of an eager queue protocol are based on the following: * Multicast (broadcast) messaging * Heartbeat * Token Multicasts/Broadcasts are used so that all processors linked into an eager queue are constantly updated. Update messages are sent regularly (possibly at 10KHz or more) using a distributed heartbeat. This is so the system is responsive, that is requests can be satisfied quickly and new processors can join quickly and existing processors can exit quickly. In its quiet state observing an eager queue channel will result in the following sequence of Actions: wait -> UPDATE -> wait -> UPDATE -> wait -> UPDATE -> . . . The eager queue data structure forms a shared token that all QCells have. This has some similarities to the use of tokens in (Hansen, Jansen, et al., 2005). The token is used to establish who is next to broadcast, when requests can be made, who will accept a request and also to minimise collisions during requests and state changes. If we look at the complete sequence of UPDATE messages we would see something like: * wait * {p2, UPDATE, [p2, p3, p1, p5, p9]} * wait * {p9, UPDATE, [p9, p2, p3, p1, p5]} * wait * {p5, UPDATE, [p5, p9, p2, p3, p1]} * wait * {p1, UPDATE, [p1, p5, p9, p2, p3]} * wait * {p3, UPDATE, [p3, p1, p5, p9, p2]} * wait * {p2, UPDATE, [p2, p3, p1, p5, p9]} * . . . The UPDATE messages act as a synchronising pulse. All other messages will occur after an UPDATE. This is key to reducing collisions. Monitoring the IQC, message sequences will always be of the form: UPDATE -> MESSAGE* -> wait-> UPDATE -> MESSAGE* -> wait -> UPDATE... Where MESSAGE := JOIN | EXIT | REQUEST | ACCEPT | DECLINE The exact sequences of messages can be defined as a regular expression. This is the characteristic signature of an Eager Queue Protocol: (->wait->UPDATE((->REQUEST->DECLINE*->ACCEPT)|->JOIN|->EXIT)*)* An UPDATE will only ever occur after a quiet (wait) interval on the IQC. If a processor wants to REQUEST another processor, it does so by broadcasting a REQUEST after an UPDATE. The exact time it can broadcast the REQUEST is determined by its position in the eager queue. EQP dictates that a REQUEST must be followed by an ACCEPT or DECLINE. In the current EQP implementation the processor that must accept is the processor at the head of the queue, i.e. the most recent to broadcast (the choice of processor can be changed, e.g. to the processor that has been waiting longest etc.). * {UPDATE, [p2, p3, p1, p5, p9]} * {REQUEST(p99), [p2, p3, p1, p5, p9]} * {ACCEPT, [p3, p1, p5, p9]} In the current implementation when an ACCEPT happens the accepting processor is removed from the queue, this may also be changed so that the state of processors is kept in the queue, also in the current implementation the requesting processor does not have to be in the queue. Because the whole system has no hard handshakes processors need to be able to decline a REQUEST, for example when a processor is shutting down or just pulling out of the queue. If a processor declines then the next processor in line has to ACCEPT or DECLINE, this continues until the REQUEST is accepted or the system is exhausted. Exhaustion should be rare as it is assumed that there is a large number of available processors and processors switch on when the supply of available processors is low. * {UPDATE, [p2, p3, p1, p5, p9]} * {REQUEST(p99), [p2, p3, p1, p5, p9]} * {DECLINE, [p3, p1, p5, p9]} * {DECLINE, [p1, p5, p9]} * {ACCEPT, [p5, p9]} If we add requesting processors into the queue then collisions can be further avoided by allocating timeslots when requests can be made. The time slots can be as small as the signal detection time. |<- UPDATE ->|<- r1 ->|<- r2 ->| . . . |<- rn ->|<- JOIN/EXIT ->| If there are a large number of potential requesting processors rather than have n time-slots, e.g. |<- UPDATE ->|<- r1 ->|<- r2 ->| . . . |<- r16 ->|<- JOIN/EXIT ->| The transmission slots can be arranged in a square grid: |<- UPDATE ->|<- r1 ->|<- r2 ->|<- r3 ->|<- r4 ->|<- JOIN/EXIT ->| |<- r5 ->|<- r6 ->|<- r7 ->|<- r8 ->| |<- r9 ->|<- r10 ->|<- r11 ->|<- r12 ->| |<- r13 ->|<- r14 ->|<- r15 ->|<- r16 ->| If there is a collision in a time slot the column is flattened to resolve the collision: |<- UPDATE ->|<- ! ->|<- r1 ->|<- r5 ->|<- r9 ->|<- r13 ->|<- r2 ->| |<- ! ->| |<- r6 ->| |<- ! ->| |<- r10 ->| |<- ! ->| |<- r14 ->| The time slots can be tuned statistically based on the history of the channel. The JOIN and EXIT actions allow processors to join and exit an eager queue. * wait * {p2, UPDATE, [p2, p3, p1, p5, p9]} * {p1, EXIT, [p2, p3, p5, p9]} * {p99, JOIN, [p99, p2, p3, p5, p9]} * wait * {p9, UPDATE, [p9, p99, p2, p3, p5]} * ... The above has just outlined how an Eager Queue Protocol works. More detail can be found on the Launchpad EQP project site: * https://launchpad.net/eqp REFERENCES ---------- (Fahndrich, Aiken, et al., 2006) "Language Support for Fast and Reliable Message-based Communication in Singularity OS" Fahndrich, M., Aiken, M., Chris Hawblitzel, Orion Hodson, Galen Hunt, James R. Larus and Steven Levi. EuroSys06, April 18-21, 2006. (Hansen, 1970) "The nucleus of a multiprogramming system" Hansen, Per Brinch Communications of the ACM Volume 13 , Issue 4 (April 1970) Pages: 238 - 241 Year of Publication: 1970 ISSN:0001-0782 (Hansen, Jansen, et al., 2005) "RTnet: a distributed real-time protocol for broadcast-capable networks" Hanssen, F.; Jansen, P.G.; Scholten, H.; Mullender, S. Autonomic and Autonomous Systems and International Conference on Networking and Services, 2005. ICAS-ICNS 2005. Joint International Conference on Volume , Issue , 23-28 Oct. 2005 Page(s): 18 - 18 Digital Object Identifier 10.1109/ICAS-ICNS.2005.85 (Hunt, Larus, 2007) "Singularity: Rethinking the Software Stack" Hunt, Galen C., Larus, James R. Microsoft Research Labs, 2007. (Thomsen, 1989) "A calculus of higher order communicating systems" Thomsen, Bent Annual Symposium on Principles of Programming Languages Proceedings of the 16th ACM SIGPLAN-SIGACT symposium on Principles of programming languages Austin, Texas, United States Pages: 143 - 154 Year of Publication: 1989 ISBN:0-89791-294-2