El mié, 7 ago 2024 a la(s) 10:04 a.m., Janne Johansson (icepic...@gmail.com) escribió: > > > > > What is this kernel lock everybody talks about. I mean what is locked? > > > > Some actions must be done and devs call lock before and after it is > > > > done, they call unlock? > > > > What is kernel lock doing exactly, it prevents other procedures to run? > > > I was on wikipedia, i did my gogling. There is too much spread and not > > very much related to OpenBSD. > > But it is a large topic i think. And i am not up to understand all. > > One example would be when a network packet comes in. At some point it > will end up in an inbound queue from the network driver. If you have > 20 cores in the machine, 0,1,or 20 cores could potentially decide > immediately to pick that packet out of the queue and act on it > (receive it, filter it, route it, whatever) and decrease the number of > packets in the queue with 1 as they work on it. Now if 20 do this in > parallel without any checks, you could have the counter decreased by > 20, and have 20 copies of the packet being handled. So you need some > kind of semaphore for the first core to react to tell the others to > keep off the input queue for a moment. > > When starting out with SMP development, you make this the easiest > possible, but having one single lock, the Giant Lock. This will make > the above problem go away in some sense, but it will also make your > kernel not be able to process any more packets than if it was a > single-core cpu without the locking. If this was a router, a very > simplified chain of operations would be "network driver receives > packet", "packet input processing", "routing decision is made", > "output packet processing" and lastly "network driver sends packet > out". Five easy steps. > > In the Giant Lock scenario, it starts with with the network driver > causing an interrupt, the kernel acts on it, sets the Giant Lock to > closed and more or less runs all five steps, and exits back to running > userspace things again, releasing the lock so some other core can grab > it as needed later. If many cores often wait for this lock, they will > not be doing lots of useful work and it leads to a worse situation > than if you had only one core. > > As the MP improvements are worked on, instead of locking the whole > kernel, certain subsystems (like network, disk, memory, audio) will > get smaller locks of their own, "fine-grained" locks. This way, the > web traffic your www server handles can cause disk IO in parallel with > packet reception if they end up on separate cores. For some systems, > an improvement, for routers - not so much. > > As you move these subsystem locks further and further down, more and > more can and will start happening in parallel. Perhaps handling > ethernet IRQs for packet input/output can be separate from the > ip_input() and the routing decision engine? If you manage to separate > all five steps of my simplified network routing described above, you > can now get 5 cores working on shuffling packets through the machine, > each doing one of the parts each. Lots better, but it might not > optimally utilize all 20 cores you have. Still, an improvement, as > long as the locking dance is not too expensive compared to the work > you do while holding it. > > While all this sounds neat and simple, after the few first steps it > becomes increasingly difficult to make large gains for several > reasons. One is that some subsystems are all over the map, like NFS. > It is both IO and network traffic and if you do wacky things like > "swap on a file that is stored on a remote NFS server", things get > very interesting. Of course the kernel started out single-core-only so > you could always read out global variables and things like "how many > packets are there in the queue right now" without thinking about if > anyone else might have modified it just as I read it. This also means > that old assumptions no longer hold, and if you do this wrong, weird > things happen in the kernel and you get weird crashes as a result. > > The term "Giant Lock" is also a bit of a misnomer, since at least on > day #1 the getpid() syscall was unlocked, as was certain parts of the > scheduler (I think, not important) so while there was A lock in place, > if your entire dream of computing was to have 20 processes run > getpid() over and over, they could well do that in parallel as much as > they liked as if the OS was totally unlocked. Not very useful, but > still a small, small, percentage of the usable syscalls were totally > possible to run unlocked. A side note on fast syscalling would also be > how linux made gettimeofday() into not-a-syscall by mapping a readonly > page into each process so when they call this via glibc, they read the > timer value out of a page in ram without having to make a syscall at > all, which of course is even faster than having an unlocked call for > it. > > I made a script that would loop over the relevant OpenBSD kernel > syscall C source and graph at which date the number of unlocked calls > changed, and tried to make a decent gnuplot of it, but it becomes > weird. Sometimes, there is this "one step back, two steps forward", > sometimes things got reverted for reasons and so on, but if you want > to see the graph I have it here: > http://c66.it.su.se:8080/obsd/mplocks.png > It ends in 2020 when I get bored of making it, someone could perhaps > make a new one that covers up until today. > > Now, syscalls is just one metric of how unlocked the kernel is, IRQ > handling is another, making network drivers spread packet handling > over different cores is yet another, memory subsystem and IO drivers > and filesystems all have various parts that may or may not be easy to > make fine-grained. For memory allocations, having per-cpu queues could > be a thing for which I think there are some patches flying around, for > IRQ spreading it becomes machine/arch dependent I guess on top of all > other problems. I'm sure there are tons of details I am missing, like > clocks, scheduling of processes and threads, collecting metrics in the > kernel and so on. > > Also, packet handling is much more complex than my 5-step example > above with PF, bpf and other things wanting to have a look at packets > in between. This is perhaps why some people say "pf is slow", because > they think pf is the only step in a chain of 5-10-15 other parts like > ether_input() feeding into ip_input() and so forth. Making pf MP-aware > would help some, but not a huge lot UNLESS all other parts also work > in parallel, otherwise you just queue stuff up at the first > non-unlocked call after pf is done. Also, I think pf normally consume > a very small percentage of time for all the packet handling a firewall > does, so reducing an already small part might be worth very little > compared to other more expensive parts of the total packet processing > that is done. > > Being able to send 20 IO requests at once is nice if you have a > parallelizable file system and the drive is nvme/sas/scsi-connected, > but if it is a CF-disk on a simple ATA/IDE interface, the device will > only let one outstanding IO be in flight at any time so you more or > less lose all gains from a parallel fs, and in reverse if you have a > very simple fs (msdos perhaps?) and it by design can't handle any > kinds of parallelism, it would not matter much to have it as the > single partition on nvme that can eat 256-1024 IO requests in parallel > if the filesystem code never issues them side by side. > > But it is a complex problem, and while OpenBSD might be late to the > game, it also didn't move too fast in this regard and broke things, > and this is worth something. I would not have liked some period of > years when obsd was unusable because non-careful code was pushed into > the kernel only because a benchmark says "new code faster". > > If people actually are interested, by all means do like Alexander > Bluhm, Hrvoje Popovski et al. and rig a small lab and carefully test > each crazy diff that comes out, with before and after numbers for > comparisons. Finding out that a diff gives the developer laptop a 10% > improvement is one thing, finding out it gives your lab setup a 20% > reduction in perf because different drivers, hw, protocols is really > good to know before the diff goes in. > > -- > May the most significant bit of your life be positive. >
Technicalities aside (relevant if we want to test and improve things correctly), the speed increase is noticeable. Ryzen 7 3800X, 32GB RAM and NVME, and certainly the responsiveness of Chrome and Firefox has improved a lot. LibreOffice and GIMP responsiveness, if I don't see it, but I imagine the new improvements in #233+ apply to UDP and the recent mpsafe code that was added to the kernel. Thanks for your excellent work everyone! -- "Dios en su Cielo, todo bien en la Tierra" ***********************************************