Re: Lastest snapshot - all application got a speed increase

Mihai Popescu Wed, 07 Aug 2024 07:38:34 -0700

See the inserted comments, please.

On Wed, Aug 7, 2024 at 3:54 PM Janne Johansson <icepic...@gmail.com> wrote:
>
> > > > What is this kernel lock everybody talks about. I mean what is locked?
> > > > Some actions must be done and devs call lock before and after it is
> > > > done, they call unlock?
> > > > What is kernel lock doing exactly, it prevents other procedures to run?
>
> > I was on wikipedia, i did my gogling. There is too much spread and not
> > very much related to OpenBSD.
> > But it is a large topic i think. And i am not up to understand all.
>
> One example would be when a network packet comes in. At some point it
> will end up in an inbound queue from the network driver. If you have
> 20 cores in the machine, 0,1,or 20 cores could potentially decide
> immediately to pick that packet out of the queue and act on it
> (receive it, filter it, route it, whatever) and decrease the number of
> packets in the queue with 1 as they work on it. Now if 20 do this in
> parallel without any checks, you could have the counter decreased by
> 20, and have 20 copies of the packet being handled. So you need some
> kind of semaphore for the first core to react to tell the others to
> keep off the input queue for a moment.


Let's say we have a CPU with 16 cores. One packet in network card
buffer is not god for example. There is no parallel action needed for
1 packet.
Let's say there are 16 packets waiting in the network card buffer. The
basic idea is one core will deal with each on a sequential one by one
method. Very time consuming.
Do you want to say that it is possible that each core to "extract" one
packet at the time from network card buffer?
I guess the card can handle one packet at the time to each core, the
the cores stays busy with other packet related tasks.
So what a "lock" will do here: tell other cores "look, the card is
busy answering the one core request, hold your calls or form a queue
until i finish serving this packet and register (delete) it from the
buffer"?
What a "giant lock" will do, it will block anything on the hardware,
like disk access for a simple packet transfer from a network card?


>
> As the MP improvements are worked on, instead of locking the whole
> kernel, certain subsystems (like network, disk, memory, audio) will
> get smaller locks of their own, "fine-grained" locks. This way, the
> web traffic your www server handles can cause disk IO in parallel with
> packet reception if they end up on separate cores. For some systems,
> an improvement, for routers - not so much.

I think there is no "one solution fits all scenarios" so each part
must be understood and implemented differently?
Am i right?

> As you move these subsystem locks further and further down, more and
> more can and will start happening in parallel. Perhaps handling
> ethernet IRQs for packet input/output can be separate from the
> ip_input() and the routing decision engine? If you manage to separate
> all five steps of my simplified network routing described above, you
> can now get 5 cores working on shuffling packets through the machine,
> each doing one of the parts each. Lots better, but it might not
> optimally utilize all 20 cores you have. Still, an improvement, as
> long as the locking dance is not too expensive compared to the work
> you do while holding it.

So the "block" word tell that code execution on cores is actually
blocked from execution, just to be sure you are not doing something
wrong?

> While all this sounds neat and simple, after the few first steps it
> becomes increasingly difficult to make large gains for several
> reasons. One is that some subsystems are all over the map, like NFS.
> It is both IO and network traffic and if you do wacky things like
> "swap on a file that is stored on a remote NFS server", things get
> very interesting. Of course the kernel started out single-core-only so
> you could always read out global variables and things like "how many
> packets are there in the queue right now" without thinking about if
> anyone else might have modified it just as I read it. This also means
> that old assumptions no longer hold, and if you do this wrong, weird
> things happen in the kernel and you get weird crashes as a result.

>From what I see the devs must do a rewrite of the kernel, maybe not
100% but there is some substantial effort.

> Also, packet handling is much more complex than my 5-step example
> above with PF, bpf and other things wanting to have a look at packets
> in between. This is perhaps why some people say "pf is slow", because
> they think pf is the only step in a chain of 5-10-15 other parts like
> ether_input() feeding into ip_input() and so forth. Making pf MP-aware
> would help some, but not a huge lot UNLESS all other parts also work
> in parallel, otherwise you just queue stuff up at the first
> non-unlocked call after pf is done. Also, I think pf normally consume
> a very small percentage of time for all the packet handling a firewall
> does, so reducing an already small part might be worth very little
> compared to other more expensive parts of the total packet processing
> that is done.

Just as an exercise, not an actual idea for implementation: one could
decide to create a RAM memory buffer (more like a TCP/IP stack), the
cores will do the job of moving the packets there, and then each
application requesting TCP/IP packages will just copy them from there.
Is it valid to handle the SMP? Of course this is a simplified
scenario.

> Being able to send 20 IO requests at once is nice if you have a
> parallelizable file system and the drive is nvme/sas/scsi-connected,
> but if it is a CF-disk on a simple ATA/IDE interface, the device will
> only let one outstanding IO be in flight at any time so you more or
> less lose all gains from a parallel fs, and in reverse if you have a
> very simple fs (msdos perhaps?) and it by design can't handle any
> kinds of parallelism, it would not matter much to have it as the
> single partition on nvme that can eat 256-1024 IO requests in parallel
> if the filesystem code never issues them side by side.

I think this is a case of so called bottleneck, where you can't do much.

> But it is a complex problem, and while OpenBSD might be late to the
> game, it also didn't move too fast in this regard and broke things,
> and this is worth something. I would not have liked some period of
> years when obsd was unusable because non-careful code was pushed into
> the kernel only because a benchmark says "new code faster".

I think it is very easy to create a mess and drop everything on the floor.
So right now developers are looping thru the code just to see where
they can transform more "global" lock into fine grained locks?
I guess you need to have some "lock" here and there, you cannot
eliminate them all.

Thank you for your explanations and time.

Re: Lastest snapshot - all application got a speed increase

Reply via email to