On 10 Jun 2013, at 15:40, Florent Peterschmitt <flor...@peterschmitt.fr> wrote:

> Ok and isn't it a "bad" thing ? I mean, even if the video driver
> crashes, I still want to have the ability to reboot the right way,
> avoiding corrupted files and WIP lose.
> 
> Another thing is a non-critical module that can crash, but because not
> used by all apps on the machine, letting them ones that can continue run.
> 
> But I don't know what is the approach of FreeBSD and devs about that.

Yes, it's a bad thing.  If we had privilege domain crossing that was as cheap 
as a function call (or, at least, almost as cheap) then we could implement 
fine-grained separation within the kernel and not incur any performance 
penalty.  Unfortunately, this is not possible without some fairly significant 
changes to current CPU instruction sets (which, actually, several of us in 
FreeBSD land are working on, but that's unlikely to be seen in any mainstream 
processor for at least 5-10 years).  

In the current world, we have a fairly poor selection of choices for isolation. 
 On i386, we had 4 protection rings, but on the 486 and newer the cost of 
transitions between to and from rings 1 and 2 were increasingly expensive 
because most operating systems only used rings 0 and 3 (Netware and OS/2 are 
the two exceptions that I know of).  On other architectures we just have 
privileged and unprivileged modes.  Code in privileged mode can't be isolated 
from other code in privileged mode, code that is in unprivileged mode incurs 
some overhead for calls into privileged mode.

There are some tricks that you can do to enforce some weaker protection.  For 
example, every driver could be written on 64-bit platforms to use 32-bit 
pointers and have a 4GB segment of privileged-mode virtual memory allocated for 
it to use and have to go through special gates to do anything with the whole 
kernel's address space.  You'd then end up with a lot more TLB churn, but 
protection against a number of kinds of pointer error (protection faults inside 
the 32-bit window would just result in that module being killed and restarted). 
 

Unfortunately, there are several problems with this.  The most obvious is that 
killing a module is not always trivial.  For example, a module may hold various 
locks, but it's not always clear which module owns a lock.  Locks are held by 
kernel threads, but a thread can have a call stack spanning several modules.  
Working out exactly which driver holds the lock is not always trivial, and 
there is also the question of what you do about a thread that contains some 
call frames belonging to the module that you've just killed.  You'd need to 
provide some exception-like mechanism for handling this case (and unwinding the 
stack in the case where it is potentially corrupt is also nontrivial).  

An alternative is to run the driver entirely, or mostly, in userspace.  The 
'mostly' option is often better.  For example, certain categories of USB 
devices are exposed by the FreeBSD kernel as USB generic devices (ugen driver) 
and some userspace component sends USB commands to it.  This involves some 
extra copying, but means that most of the (potentially buggy) driver logic is 
in the application.  If it crashes, you lose the application state (which, in a 
desktop setting, is only slightly better than crashing the kernel), but not the 
whole kernel.  

In the case of certain modern network interfaces (Infiniband in particular) and 
modern GPUs, the kernel handles even less.  The device has some hardware 
support for multiplexing and isolation and so all that the kernel has to do is 
set up some memory that both the device and the userspace code can access - 
including the device registers for controlling a command queue - and then 
delegate most of the operation to the userspace code.  This requires an IOMMU 
to actually provide isolation, otherwise an errant DMA request can still result 
in accessing or modifying kernel memory.

Even with this kind of isolation, there are still potential problems.  Many 
devices react poorly to bad input and can be left in a state that is hard to 
recover from, even if the driver itself is easy to restart.  A lot of OS 
instability (I saw a number as high as 20% of OS crashes quoted at MSR 
recently) is caused by drivers poorly reacting to intermittent hardware errors. 
 Just restarting the driver (an approach that they tried) solved some, but not 
all of these cases.

Of course, there are a lot of things in the kernel that are not drivers.  For 
example, FUSE allows us to run filesystems in userspace instead of in the 
kernel.  This comes with a performance penalty as a result of having to copy 
data from the kernel's buffer cache into the filesystem process, then back into 
the kernel, and then into the destination process (for a read - the same 
sequence in the opposite order on write).  Similarly, we have CUSE for 
character devices, which is used by a lot of webcam drivers.  These are a 
relatively good use-case for userspace drivers, because they are typically a 
streaming interface (data comes just from the device and there isn't a lot of 
need for latency-sensitive round trips from the app to the driver) and the 
latency that users care about is on the order of 1/24th of a second, which is a 
very long time on a modern computer.  There are other examples, such as Netmap 
for pushing network packets directly into userspace, which can be combined with 
something like Ilias Marinos' userspace network stack to run the entire TCP/IP 
stack in userspace.

Moving drivers into userspace is not a panacea.  It adds more asynchronous 
behaviour, which makes reasoning about the code harder and makes deadlocks far 
easier to introduce (for example, any userspace process has a lot of implicit 
interactions with the VM subsystem, which are more explicit in the kernel, and 
doesn't have a shared global namespace for locks).  Most of the code in the 
kernel is there because, when the code was written, it was the most sensible 
place for it.  In most cases, that is still true, although as CPU and software 
architectures evolve that may change.

David

Attachment: signature.asc
Description: Message signed with OpenPGP using GPGMail

Reply via email to