I think you can preserve the plan 9 model, get an efficient solution, and
do so without adding system calls. We did it on Blue Gene.

On Blue Gene, we had issues with using /dev/bintime because even doing a
read(fd, pointer, size) had a lot of jitter, due to operations such as
okaddr.  Overhead was even more critical for the global barrier network
(gib), which could do a full barrier across 64k nodes in 125ns.

Paper here:
https://www.researchgate.net/publication/265264528_Using_Currying_and_process-private_system_calls_to_break_the_one-microsecond_system_call_barrier

The model we used implemented currying of syscall arguments: write(fd,
pointer, size) became a new system call, with the fd and pointer verified;
and process private system calls, so that we were not polluting the global
system call table with a bunch of new system calls. Process private system
calls build on Plan 9's model of per-process resources; it worked very well
for us.

We added the support via the gib ctl file, so we did not need to
change system calls. In the example below, gib is the global barrier, and
gibctl is its ctl file.

int cfd, gdf, scnum=256;
char area[1], cmd[256];
gfd = open("/dev/gib", ORDWR);
cfd = open("/dev/gib0ctl", OWRITE);
cmd = smprint("fastwrite %d %d 0x%p %d", scnum, fd, area, sizeof(area));
write(cfd, cmd, strlen(cmd));
close(cfd);
docall(scnum);

We could keep calling docall(scnum), and most of the code in syscall for
checking was bypassed. A write or read system called had all the overhead
of sysr1.

>From the paper: "With the traditional
write path, it took approximately 3,000 cycles per write. Since the BG/P
uses
850 MHz PowerPC processors, this means a normal write takes approximately
3.529 microseconds. However, when using the private system calls, it only
takes
around 620 cycles to do a write, or 0.729 microseconds."

This equaled the performance of an OS bypass solution, while not bypassing
the OS.

The bigger factor was the lack of jitter for IO. All the checking in
syscall is done once, not on every call.

We measured every program that ran on Plan 9 over a period of days, over a
LOT of system calls, and it turned out, for the most part, programs call
read and write with the the same fd, the same address, and the same size
(typically well under a page), so locking that fd, address, and size down
are a pretty big win.

Had we been able to use this for making bintime efficient, no nsec() system
call would ever have been needed. I don't think your kread and kwrite are
needed either.

It's nice to avoid adding more system calls, because as nsec() shows, if
they're not right, we're still stuck with them forever.

I think process private system calls can provide what you want. The code is
still out there in the blue gene kernel for plan 9. It's very small.


On Mon, Mar 10, 2025 at 8:17 AM Russ Cox <r...@swtch.com> wrote:

> Hi all,
>
> Cinap said out in the other thread that nsec had been added and then
> abandoned because it wasn't right. That turns out only to be half wrong -
> it's not true today but it probably should be true in the future. We do
> need a time-related special system call, but not that one.
>
> I just saw a Go program crash because it observed monotonic time move
> backward. That happened because on Plan 9, Go does not have easy access to
> monotonic time, only Unix time. And when Unix time moves backward (like
> timesync makes it do) then Go sees that as monotonic time moving backward.
> The ironic thing is that #c/bintime has all the info Go needs, but Go
> stopped using it.
>
> The nsec system call was added to avoid needing to keep #c/bintime open in
> all programs, avoid the problems of it accidentally using a standard fd (0
> 1 2) etc. But nsec is too specialized. bintime returns more than just Unix
> nanoseconds. The right answer would have been to add a readbintime(p, n)
> system call that acts like pread(/dev/bintime, p, n, 0), dispatching to the
> kernel's readbintime function. I suggest we actually do that, which would
> make monotonic time access work right.
>
> While we are avoiding pre-opened file descriptors, the other thing modern
> operating systems have come to realize is that /dev/random is important
> enough to be able to access without a file descriptor. It would be good to
> add a readcrypto(p, n) system call at the same time.
>
> Perhaps there should not be two new system calls. Perhaps it should be one
> new readspecial(id, p, n) system call.
>
> Or perhaps there should be no new system calls, and instead pread should
> accept a few distinguished negative file descriptors. Obviously fd=-1 has
> to keep returning an error, but perhaps we should define that -2 is
> #c/bintime and -3 is #c/random. Or if -2 is too close to -1, we could use
> -1000 and -1001.
>
> Personally I think the negative numbers are a bit too special, and I'd be
> inclined to add two new system calls kread(kfd, data, n, off) and
> kwrite(kfd, data, n, off), which are like pread and pwrite except that they
> operate on "kernel file descriptors", which are small integers that are
> always open and refer to specific kernel resources. The initial set of
> kernel file descriptors are
>
>     0 #c/bintime
>     1 #c/random
>
> This set could be extended over time; use of an unrecognized kfd would
> return an error. This approach solves the "keep special fds open" problem
> directly, without abandoning Plan 9's "everything is a file" quite as much
> as nsec(2) does or readbintime(2) or readcrypto(2) would.
>
> Thoughts?
>
> Best,
> Russ
> *9fans <https://9fans.topicbox.com/latest>* / 9fans / see discussions
> <https://9fans.topicbox.com/groups/9fans> + participants
> <https://9fans.topicbox.com/groups/9fans/members> + delivery options
> <https://9fans.topicbox.com/groups/9fans/subscription> Permalink
> <https://9fans.topicbox.com/groups/9fans/T59810df4fe34a033-Mc872d415f55dab215c26a530>
>

------------------------------------------
9fans: 9fans
Permalink: 
https://9fans.topicbox.com/groups/9fans/T59810df4fe34a033-M8b2e41af7cb9d42af44daa90
Delivery options: https://9fans.topicbox.com/groups/9fans/subscription

Reply via email to