Re: Lastest snapshot - all application got a speed increase

José Maldonado Wed, 07 Aug 2024 14:25:23 -0700

El mié, 7 ago 2024 a la(s) 10:04 a.m., Janne Johansson
(icepic...@gmail.com) escribió:
>
> > > > What is this kernel lock everybody talks about. I mean what is locked?
> > > > Some actions must be done and devs call lock before and after it is
> > > > done, they call unlock?
> > > > What is kernel lock doing exactly, it prevents other procedures to run?
>
> > I was on wikipedia, i did my gogling. There is too much spread and not
> > very much related to OpenBSD.
> > But it is a large topic i think. And i am not up to understand all.
>
> One example would be when a network packet comes in. At some point it
> will end up in an inbound queue from the network driver. If you have
> 20 cores in the machine, 0,1,or 20 cores could potentially decide
> immediately to pick that packet out of the queue and act on it
> (receive it, filter it, route it, whatever) and decrease the number of
> packets in the queue with 1 as they work on it. Now if 20 do this in
> parallel without any checks, you could have the counter decreased by
> 20, and have 20 copies of the packet being handled. So you need some
> kind of semaphore for the first core to react to tell the others to
> keep off the input queue for a moment.
>
> When starting out with SMP development, you make this the easiest
> possible, but having one single lock, the Giant Lock. This will make
> the above problem go away in some sense, but it will also make your
> kernel not be able to process any more packets than if it was a
> single-core cpu without the locking. If this was a router, a very
> simplified chain of operations would be "network driver receives
> packet", "packet input processing", "routing decision is made",
> "output packet processing" and lastly "network driver sends packet
> out". Five easy steps.
>
> In the Giant Lock scenario, it starts with with the network driver
> causing an interrupt, the kernel acts on it, sets the Giant Lock to
> closed and more or less runs all five steps, and exits back to running
> userspace things again, releasing the lock so some other core can grab
> it as needed later. If many cores often wait for this lock, they will
> not be doing lots of useful work and it leads to a worse situation
> than if you had only one core.
>
> As the MP improvements are worked on, instead of locking the whole
> kernel, certain subsystems (like network, disk, memory, audio) will
> get smaller locks of their own, "fine-grained" locks. This way, the
> web traffic your www server handles can cause disk IO in parallel with
> packet reception if they end up on separate cores. For some systems,
> an improvement, for routers - not so much.
>
> As you move these subsystem locks further and further down, more and
> more can and will start happening in parallel. Perhaps handling
> ethernet IRQs for packet input/output can be separate from the
> ip_input() and the routing decision engine? If you manage to separate
> all five steps of my simplified network routing described above, you
> can now get 5 cores working on shuffling packets through the machine,
> each doing one of the parts each. Lots better, but it might not
> optimally utilize all 20 cores you have. Still, an improvement, as
> long as the locking dance is not too expensive compared to the work
> you do while holding it.
>
> While all this sounds neat and simple, after the few first steps it
> becomes increasingly difficult to make large gains for several
> reasons. One is that some subsystems are all over the map, like NFS.
> It is both IO and network traffic and if you do wacky things like
> "swap on a file that is stored on a remote NFS server", things get
> very interesting. Of course the kernel started out single-core-only so
> you could always read out global variables and things like "how many
> packets are there in the queue right now" without thinking about if
> anyone else might have modified it just as I read it. This also means
> that old assumptions no longer hold, and if you do this wrong, weird
> things happen in the kernel and you get weird crashes as a result.
>
> The term "Giant Lock" is also a bit of a misnomer, since at least on
> day #1 the getpid() syscall was unlocked, as was certain parts of the
> scheduler (I think, not important) so while there was A lock in place,
> if your entire dream of computing was to have 20 processes run
> getpid() over and over, they could well do that in parallel as much as
> they liked as if the OS was totally unlocked. Not very useful, but
> still a small, small, percentage of the usable syscalls were totally
> possible to run unlocked. A side note on fast syscalling would also be
> how linux made gettimeofday() into not-a-syscall by mapping a readonly
> page into each process so when they call this via glibc, they read the
> timer value out of a page in ram without having to make a syscall at
> all, which of course is even faster than having an unlocked call for
> it.
>
> I made a script that would loop over the relevant OpenBSD kernel
> syscall C source and graph at which date the number of unlocked calls
> changed, and tried to make a decent gnuplot of it, but it becomes
> weird. Sometimes, there is this "one step back, two steps forward",
> sometimes things got reverted for reasons and so on, but if you want
> to see the graph I have it here:
> http://c66.it.su.se:8080/obsd/mplocks.png
> It ends in 2020 when I get bored of making it, someone could perhaps
> make a new one that covers up until today.
>
> Now, syscalls is just one metric of how unlocked the kernel is, IRQ
> handling is another, making network drivers spread packet handling
> over different cores is yet another, memory subsystem and IO drivers
> and filesystems all have various parts that may or may not be easy to
> make fine-grained. For memory allocations, having per-cpu queues could
> be a thing for which I think there are some patches flying around, for
> IRQ spreading it becomes machine/arch dependent I guess on top of all
> other problems. I'm sure there are tons of details I am missing, like
> clocks, scheduling of processes and threads, collecting metrics in the
> kernel and so on.
>
> Also, packet handling is much more complex than my 5-step example
> above with PF, bpf and other things wanting to have a look at packets
> in between. This is perhaps why some people say "pf is slow", because
> they think pf is the only step in a chain of 5-10-15 other parts like
> ether_input() feeding into ip_input() and so forth. Making pf MP-aware
> would help some, but not a huge lot UNLESS all other parts also work
> in parallel, otherwise you just queue stuff up at the first
> non-unlocked call after pf is done. Also, I think pf normally consume
> a very small percentage of time for all the packet handling a firewall
> does, so reducing an already small part might be worth very little
> compared to other more expensive parts of the total packet processing
> that is done.
>
> Being able to send 20 IO requests at once is nice if you have a
> parallelizable file system and the drive is nvme/sas/scsi-connected,
> but if it is a CF-disk on a simple ATA/IDE interface, the device will
> only let one outstanding IO be in flight at any time so you more or
> less lose all gains from a parallel fs, and in reverse if you have a
> very simple fs (msdos perhaps?) and it by design can't handle any
> kinds of parallelism, it would not matter much to have it as the
> single partition on nvme that can eat 256-1024 IO requests in parallel
> if the filesystem code never issues them side by side.
>
> But it is a complex problem, and while OpenBSD might be late to the
> game, it also didn't move too fast in this regard and broke things,
> and this is worth something. I would not have liked some period of
> years when obsd was unusable because non-careful code was pushed into
> the kernel only because a benchmark says "new code faster".
>
> If people actually are interested, by all means do like Alexander
> Bluhm, Hrvoje Popovski et al. and rig a small lab and carefully test
> each crazy diff that comes out, with before and after numbers for
> comparisons. Finding out that a diff gives the developer laptop a 10%
> improvement is one thing, finding out it gives your lab setup a 20%
> reduction in perf because different drivers, hw, protocols is really
> good to know before the diff goes in.
>
> --
> May the most significant bit of your life be positive.
>


Technicalities aside (relevant if we want to test and improve things
correctly), the speed increase is noticeable.

Ryzen 7 3800X, 32GB RAM and NVME, and certainly the responsiveness of
Chrome and Firefox has improved a lot.

LibreOffice and GIMP responsiveness, if I don't see it, but I imagine
the new improvements in #233+ apply to UDP and the recent mpsafe code
that was added to the kernel.

Thanks for your excellent work everyone!

-- 
"Dios en su Cielo, todo bien en la Tierra"
***********************************************

Re: Lastest snapshot - all application got a speed increase

Reply via email to