> > > What is this kernel lock everybody talks about. I mean what is locked? > > > Some actions must be done and devs call lock before and after it is > > > done, they call unlock? > > > What is kernel lock doing exactly, it prevents other procedures to run?
> I was on wikipedia, i did my gogling. There is too much spread and not > very much related to OpenBSD. > But it is a large topic i think. And i am not up to understand all. One example would be when a network packet comes in. At some point it will end up in an inbound queue from the network driver. If you have 20 cores in the machine, 0,1,or 20 cores could potentially decide immediately to pick that packet out of the queue and act on it (receive it, filter it, route it, whatever) and decrease the number of packets in the queue with 1 as they work on it. Now if 20 do this in parallel without any checks, you could have the counter decreased by 20, and have 20 copies of the packet being handled. So you need some kind of semaphore for the first core to react to tell the others to keep off the input queue for a moment. When starting out with SMP development, you make this the easiest possible, but having one single lock, the Giant Lock. This will make the above problem go away in some sense, but it will also make your kernel not be able to process any more packets than if it was a single-core cpu without the locking. If this was a router, a very simplified chain of operations would be "network driver receives packet", "packet input processing", "routing decision is made", "output packet processing" and lastly "network driver sends packet out". Five easy steps. In the Giant Lock scenario, it starts with with the network driver causing an interrupt, the kernel acts on it, sets the Giant Lock to closed and more or less runs all five steps, and exits back to running userspace things again, releasing the lock so some other core can grab it as needed later. If many cores often wait for this lock, they will not be doing lots of useful work and it leads to a worse situation than if you had only one core. As the MP improvements are worked on, instead of locking the whole kernel, certain subsystems (like network, disk, memory, audio) will get smaller locks of their own, "fine-grained" locks. This way, the web traffic your www server handles can cause disk IO in parallel with packet reception if they end up on separate cores. For some systems, an improvement, for routers - not so much. As you move these subsystem locks further and further down, more and more can and will start happening in parallel. Perhaps handling ethernet IRQs for packet input/output can be separate from the ip_input() and the routing decision engine? If you manage to separate all five steps of my simplified network routing described above, you can now get 5 cores working on shuffling packets through the machine, each doing one of the parts each. Lots better, but it might not optimally utilize all 20 cores you have. Still, an improvement, as long as the locking dance is not too expensive compared to the work you do while holding it. While all this sounds neat and simple, after the few first steps it becomes increasingly difficult to make large gains for several reasons. One is that some subsystems are all over the map, like NFS. It is both IO and network traffic and if you do wacky things like "swap on a file that is stored on a remote NFS server", things get very interesting. Of course the kernel started out single-core-only so you could always read out global variables and things like "how many packets are there in the queue right now" without thinking about if anyone else might have modified it just as I read it. This also means that old assumptions no longer hold, and if you do this wrong, weird things happen in the kernel and you get weird crashes as a result. The term "Giant Lock" is also a bit of a misnomer, since at least on day #1 the getpid() syscall was unlocked, as was certain parts of the scheduler (I think, not important) so while there was A lock in place, if your entire dream of computing was to have 20 processes run getpid() over and over, they could well do that in parallel as much as they liked as if the OS was totally unlocked. Not very useful, but still a small, small, percentage of the usable syscalls were totally possible to run unlocked. A side note on fast syscalling would also be how linux made gettimeofday() into not-a-syscall by mapping a readonly page into each process so when they call this via glibc, they read the timer value out of a page in ram without having to make a syscall at all, which of course is even faster than having an unlocked call for it. I made a script that would loop over the relevant OpenBSD kernel syscall C source and graph at which date the number of unlocked calls changed, and tried to make a decent gnuplot of it, but it becomes weird. Sometimes, there is this "one step back, two steps forward", sometimes things got reverted for reasons and so on, but if you want to see the graph I have it here: http://c66.it.su.se:8080/obsd/mplocks.png It ends in 2020 when I get bored of making it, someone could perhaps make a new one that covers up until today. Now, syscalls is just one metric of how unlocked the kernel is, IRQ handling is another, making network drivers spread packet handling over different cores is yet another, memory subsystem and IO drivers and filesystems all have various parts that may or may not be easy to make fine-grained. For memory allocations, having per-cpu queues could be a thing for which I think there are some patches flying around, for IRQ spreading it becomes machine/arch dependent I guess on top of all other problems. I'm sure there are tons of details I am missing, like clocks, scheduling of processes and threads, collecting metrics in the kernel and so on. Also, packet handling is much more complex than my 5-step example above with PF, bpf and other things wanting to have a look at packets in between. This is perhaps why some people say "pf is slow", because they think pf is the only step in a chain of 5-10-15 other parts like ether_input() feeding into ip_input() and so forth. Making pf MP-aware would help some, but not a huge lot UNLESS all other parts also work in parallel, otherwise you just queue stuff up at the first non-unlocked call after pf is done. Also, I think pf normally consume a very small percentage of time for all the packet handling a firewall does, so reducing an already small part might be worth very little compared to other more expensive parts of the total packet processing that is done. Being able to send 20 IO requests at once is nice if you have a parallelizable file system and the drive is nvme/sas/scsi-connected, but if it is a CF-disk on a simple ATA/IDE interface, the device will only let one outstanding IO be in flight at any time so you more or less lose all gains from a parallel fs, and in reverse if you have a very simple fs (msdos perhaps?) and it by design can't handle any kinds of parallelism, it would not matter much to have it as the single partition on nvme that can eat 256-1024 IO requests in parallel if the filesystem code never issues them side by side. But it is a complex problem, and while OpenBSD might be late to the game, it also didn't move too fast in this regard and broke things, and this is worth something. I would not have liked some period of years when obsd was unusable because non-careful code was pushed into the kernel only because a benchmark says "new code faster". If people actually are interested, by all means do like Alexander Bluhm, Hrvoje Popovski et al. and rig a small lab and carefully test each crazy diff that comes out, with before and after numbers for comparisons. Finding out that a diff gives the developer laptop a 10% improvement is one thing, finding out it gives your lab setup a 20% reduction in perf because different drivers, hw, protocols is really good to know before the diff goes in. -- May the most significant bit of your life be positive.