> > > What is this kernel lock everybody talks about. I mean what is locked?
> > > Some actions must be done and devs call lock before and after it is
> > > done, they call unlock?
> > > What is kernel lock doing exactly, it prevents other procedures to run?

> I was on wikipedia, i did my gogling. There is too much spread and not
> very much related to OpenBSD.
> But it is a large topic i think. And i am not up to understand all.

One example would be when a network packet comes in. At some point it
will end up in an inbound queue from the network driver. If you have
20 cores in the machine, 0,1,or 20 cores could potentially decide
immediately to pick that packet out of the queue and act on it
(receive it, filter it, route it, whatever) and decrease the number of
packets in the queue with 1 as they work on it. Now if 20 do this in
parallel without any checks, you could have the counter decreased by
20, and have 20 copies of the packet being handled. So you need some
kind of semaphore for the first core to react to tell the others to
keep off the input queue for a moment.

When starting out with SMP development, you make this the easiest
possible, but having one single lock, the Giant Lock. This will make
the above problem go away in some sense, but it will also make your
kernel not be able to process any more packets than if it was a
single-core cpu without the locking. If this was a router, a very
simplified chain of operations would be "network driver receives
packet", "packet input processing", "routing decision is made",
"output packet processing" and lastly "network driver sends packet
out". Five easy steps.

In the Giant Lock scenario, it starts with with the network driver
causing an interrupt, the kernel acts on it, sets the Giant Lock to
closed and more or less runs all five steps, and exits back to running
userspace things again, releasing the lock so some other core can grab
it as needed later. If many cores often wait for this lock, they will
not be doing lots of useful work and it leads to a worse situation
than if you had only one core.

As the MP improvements are worked on, instead of locking the whole
kernel, certain subsystems (like network, disk, memory, audio) will
get smaller locks of their own, "fine-grained" locks. This way, the
web traffic your www server handles can cause disk IO in parallel with
packet reception if they end up on separate cores. For some systems,
an improvement, for routers - not so much.

As you move these subsystem locks further and further down, more and
more can and will start happening in parallel. Perhaps handling
ethernet IRQs for packet input/output can be separate from the
ip_input() and the routing decision engine? If you manage to separate
all five steps of my simplified network routing described above, you
can now get 5 cores working on shuffling packets through the machine,
each doing one of the parts each. Lots better, but it might not
optimally utilize all 20 cores you have. Still, an improvement, as
long as the locking dance is not too expensive compared to the work
you do while holding it.

While all this sounds neat and simple, after the few first steps it
becomes increasingly difficult to make large gains for several
reasons. One is that some subsystems are all over the map, like NFS.
It is both IO and network traffic and if you do wacky things like
"swap on a file that is stored on a remote NFS server", things get
very interesting. Of course the kernel started out single-core-only so
you could always read out global variables and things like "how many
packets are there in the queue right now" without thinking about if
anyone else might have modified it just as I read it. This also means
that old assumptions no longer hold, and if you do this wrong, weird
things happen in the kernel and you get weird crashes as a result.

The term "Giant Lock" is also a bit of a misnomer, since at least on
day #1 the getpid() syscall was unlocked, as was certain parts of the
scheduler (I think, not important) so while there was A lock in place,
if your entire dream of computing was to have 20 processes run
getpid() over and over, they could well do that in parallel as much as
they liked as if the OS was totally unlocked. Not very useful, but
still a small, small, percentage of the usable syscalls were totally
possible to run unlocked. A side note on fast syscalling would also be
how linux made gettimeofday() into not-a-syscall by mapping a readonly
page into each process so when they call this via glibc, they read the
timer value out of a page in ram without having to make a syscall at
all, which of course is even faster than having an unlocked call for
it.

I made a script that would loop over the relevant OpenBSD kernel
syscall C source and graph at which date the number of unlocked calls
changed, and tried to make a decent gnuplot of it, but it becomes
weird. Sometimes, there is this "one step back, two steps forward",
sometimes things got reverted for reasons and so on, but if you want
to see the graph I have it here:
http://c66.it.su.se:8080/obsd/mplocks.png
It ends in 2020 when I get bored of making it, someone could perhaps
make a new one that covers up until today.

Now, syscalls is just one metric of how unlocked the kernel is, IRQ
handling is another, making network drivers spread packet handling
over different cores is yet another, memory subsystem and IO drivers
and filesystems all have various parts that may or may not be easy to
make fine-grained. For memory allocations, having per-cpu queues could
be a thing for which I think there are some patches flying around, for
IRQ spreading it becomes machine/arch dependent I guess on top of all
other problems. I'm sure there are tons of details I am missing, like
clocks, scheduling of processes and threads, collecting metrics in the
kernel and so on.

Also, packet handling is much more complex than my 5-step example
above with PF, bpf and other things wanting to have a look at packets
in between. This is perhaps why some people say "pf is slow", because
they think pf is the only step in a chain of 5-10-15 other parts like
ether_input() feeding into ip_input() and so forth. Making pf MP-aware
would help some, but not a huge lot UNLESS all other parts also work
in parallel, otherwise you just queue stuff up at the first
non-unlocked call after pf is done. Also, I think pf normally consume
a very small percentage of time for all the packet handling a firewall
does, so reducing an already small part might be worth very little
compared to other more expensive parts of the total packet processing
that is done.

Being able to send 20 IO requests at once is nice if you have a
parallelizable file system and the drive is nvme/sas/scsi-connected,
but if it is a CF-disk on a simple ATA/IDE interface, the device will
only let one outstanding IO be in flight at any time so you more or
less lose all gains from a parallel fs, and in reverse if you have a
very simple fs (msdos perhaps?) and it by design can't handle any
kinds of parallelism, it would not matter much to have it as the
single partition on nvme that can eat 256-1024 IO requests in parallel
if the filesystem code never issues them side by side.

But it is a complex problem, and while OpenBSD might be late to the
game, it also didn't move too fast in this regard and broke things,
and this is worth something. I would not have liked some period of
years when obsd was unusable because non-careful code was pushed into
the kernel only because a benchmark says "new code faster".

If people actually are interested, by all means do like Alexander
Bluhm, Hrvoje Popovski et al. and rig a small lab and carefully test
each crazy diff that comes out, with before and after numbers for
comparisons. Finding out that a diff gives the developer laptop a 10%
improvement is one thing, finding out it gives your lab setup a 20%
reduction in perf because different drivers, hw, protocols is really
good to know before the diff goes in.

-- 
May the most significant bit of your life be positive.

Reply via email to