I thought that syscons always had correct buffering. Actually, it
uses a hybrid scheme where, at least in text mode, the initial i/o is
itty-bitty 1 character+attribute at a time (16-bit i/o), but scrolling
and screen refresh is done bcopy, bcopy_io(), bcopy_fromio() and
bcopy_toio() and a couple of other functions (bzero_io(), fill*())
from/to a properly cached buffer in normal memory. It used to use
only bcopy() and a couple of others (bzero(), fill*()), so it
automatically did 64-bit i/o's on 64-bit systems, except for fillw*()
which was intentionally 16 bits for compatibilty (but it didn't use
bcopy() which is needed for even more compatibility). It is unclear
which old systems break with frame buffer i/o's larger (or smaller)
than 16 bits. I never had any (x86) hardware that didn't work with any
size. The video card might be 16-bit only, but then it should just
tell the CPU this so that the CPU reduces to 16 bits using standard
x86 mechanisms. Video cards have been PCI or better for about 20
years. PCI should support precisely 32-bits, but 64-bit frame buffer
accesses to PCI and AGP video cards always worked for me.
bcopy*io() is more technically correct, but is very badly implemented
and much slower than bcopy() on most systems. Its misimplementation
includes not even using bus-space on x86. All bcopy*io() functions
use copyw() on x86, and copyw() is just a dumb 16-bit memcpy() written
in C. Writing it in C doesn't lose anything when it is used for a
slow i/o memory, but doing 16-bit i/o's does. And doing 16-bit i/o's
doesn't even give compatibility, since bzero_io() is just bzero() on
x86, so it always does wider i/o's. syscons has always used fillw*()
and never plain fill() since it doesn't the corresponding 32-bit
writes that might be given by fill(). fill() actually does 8-bit
writes. fb also uses the badly named and implemented filll_io().
This doesn't actually support longs, but only u_int32_t. fill_io()
is at least ifdefed on ${ARCH}, so its access size is not completely
hard-coded. On arm and mips, all the ifdefed "io" functions except
fill_io use plain memcpy() or memset() so they get a maximum access
size and minimum hardware compatibilty. fillw() is 16 bits on these
arches since the access size is hard-coded in the API (and conversion
to memset() is not done).
Pessimizations in syscons have made it about twice as slow as in
FreeBSD-5.
This is probably mostly due to switching from bcopy() to copyw(). There
is a lot of bloat in upper layers, but with 2GHz CPUs it would take a
factor of about 10 pessimizations there to be comparable with i/o
pessimizations.
A correctly-implemented console driver assembles an image of the frame
buffer in fast memory and copies from there to the frame buffer in
large chunks. It is tricky to keep track of changed regions so as to
not copy unchanged regions. Copying everything at a refresh rate of
not much slower than 20 Hz works well. 200 Hz for animation, but that
is rarely needed. The bandwidth for 80x25 text mode at 20 Hz is 80 kB/
second. That was easy in 1982. I aimed for 100 Hz refresh on 2 MHz
6809 systems in 1987. PC hardware at 5 MHz was about twice as slow,
especially for frame buffers. But it could do 80 kB/second. The
bandwidth for 80x25 8x16 256 color bitmapped mode is 640kB/second.
This was difficult in 1982, but very easy now. Yet the WindowsXP
safe mode with command prompt console is about as slow at scrolling
as a 1982 system in graphics mode. It uses similar techniques to
implement the slowness:
- a large bitmapped screen. 640x200 8 colors in 1982. Quite
a bit larger (something like 1024x768 256 colors) in 20XX.
- write to the screen very slowly. Use 8-bit writes with i/o artifacts
if possible. The 1982 system had to do 8-bit writes to 3 color planes.
256-color mode is simpler than most. Writes can also be done very
slowly by using another mode and misaligning text so that every
character written needs merging with pixels from adjacent characters.
- do scrolling in software by copying 1 pixel at a time, using
read-modify-
write
- I only tested this on 5-10 year old hardware, with a 1920x1080 screen
but not all of it used for the console window, and with a laptop
1024x768 screen. A good way to be slow, one that has been portable to
PC systems since 1982, is to use the BIOS for video. The console was
about twice as fast on the laptop. This might be due to a combination
of fewer pixels and a less well pessimized BIOS.
Some old screen benchmarks. The benchmark is basically to write lines
of the screen width and scroll. I stopped updating this often about 15
years ago when frame buffers and CPUs became fast enough. But it appears
that software bloat and design errors have caught up.
% ISA ET4000: 2.4MB/sec read, 5.9MB/sec write
% VLB ET4000/W32i: 6.8MB/sec read, 25.5MB/sec write
% PCI S3/868: 3.5MB/sec read, 23.1MB/sec write
% PCI S3/Virge: 4.1MB/sec read, 40.0MB/sec write
% PCI S3/Savage: 3.3MB/sec read, 25.8MB/sec write
% PCI Xpert: 5.3MB/sec read, 21.8MB/sec write
% PCI R9200SE: 5.8MB/sec read, 60.2MB/sec write (but 120MB/sec fpu,
250/sec sfpu)
% -o means stty flag -opost
% % No-scroll:
Scrolling is avoided by repositioning the cursor after every screenful.
% % machine video O/S where real user
sys speed
% --------- ------- -------------- --------- ----- ----
----- -----
% A/2223 PCI R9200SE FreeBSD-5.2m onscreen-o .026 0.00
.026 76.9
% A/2223 PCI R9200SE FreeBSD-5.2m offscreen-o .026 0.00
.026 76.9
% A/2223 PCI R9200SE FreeBSD-5.2m onscreen .031 0.00
.031 64.5
% A/2223 PCI R9200SE FreeBSD-5.2m offscreen .031 0.00
.031 64.5
An 11 year old system.
'onscreen' means output to an active vty, 'offscreen' to an inactive vty.
The mere existence of vtys requires full buffering to fast memory for
inactive vtys, since there is no hardware frame buffer memory to write
to for the inactive vtys. You have to buffer the writes in a form that
can be replayed when an inactive vty becomes active, and converting
immediately to the final form is a good method (it does take more memory
and limits history to a raw form). 'offscreen' is potentially much
faster,
but in most cases it is only slightly faster, due to delayed refreshes
for 'onscreen' and relatively fast frame buffer memory.
-opost is tested separately because the Linux console driver was
amazingly
slow without it. This shows that it is possible for the software bloat
to be so large that it dominates hardware slowness. FreeBSD also has
lots of bloat in the tty and syscons layers near opost, but it is in the
noise compared with the old console Linux driver.
I forget the units for these measurements, except that the speed column
gives a bandwidth in MB/sec. I don't remember if this is for write(2)
bandwidth or is related to frame buffer bandwidth). Interpret them as
relative.
On a system similar to the above, syscons scrolls at 50000 lines/sec.
Non-virtually, this would require a frame buffer bandwidth of 200MB/sec,
which is several times faster than possible. Since syscons only does
a direct update for bytes written, it needs only about 1/25 of this
bandwidth or 800KB/sec. This is not quite in the noise compared with
a frame buffer bandwidth of 60.2MB/sec.
% K6/233 PCI S3/Virge minix-1.6.25++ offscreen 0.2 0.00
0.12 16.0
% K6/233 PCI S3/Virge minix-1.6.25++ onscreen 0.2 0.00
0.12 16.0
The Minix driver from 1990 (rewritten to support virtual consoles and to
be efficient) is faster than syscons of course. It is smarter about
buffering, so the onnscreen case goes at almost the same speed as the
offscreen case.
% K6/233 PCI S3/Virge FreeBSD-current onscreen-o 0.23 0.00
0.23 8.85
% K6/233 PCI S3/Virge FreeBSD-current offscreen-o 0.23 0.00
0.23 8.85
syscons is just slightly slower for the offscreen case. -current was
only
current in ~2004.
% K6/233 PCI S3/Virge FreeBSD-current onscreen 0.34 0.00
0.34 5.83
% K6/233 PCI S3/Virge FreeBSD-current offscreen 0.34 0.00
0.34 5.81
But in the onscreen case, syscons is more than 50% slower, due to less
virtualization. This slowness became slower with faster frame buffers,
but is still noticeable in benchmarks with the S3/Virge's write bandwidth
of 40.0MB/sec.
% P5/133 PCI S3/868 FreeBSD-current onscreen-o 0.39 0.00
0.39 5.10
% P5/133 PCI S3/868 FreeBSD-current offscreen-o 0.40 0.00
0.40 5.00
% P5/133 PCI S3/868 FreeBSD-current onscreen 0.51 0.00
0.50 3.92
% P5/133 PCI S3/868 FreeBSD-current offscreen 0.51 0.00
0.51 3.92
% K6/233 PCI S3/Virge linux-2.1.63 offscreen-o 0.97 0.00
0.97 2.06
% K6/233 PCI S3/Virge linux-2.1.63 onscreen-o 1.03 0.00
1.03 1.93
% K6/233 PCI S3/Virge linux-2.1.63 offscreen 1.18 0.00
1.18 1.69
% DX2/66 VLB ET4000/W32i FreeBSD-current offscreen-o 1.18 0.00
1.16 1.69
% DX2/66 VLB ET4000/W32i FreeBSD-current onscreen-o 1.27 0.02
1.23 1.57
% K6/233 PCI S3/Virge linux-2.1.63 onscreen 1.38 0.00
1.38 1.45
% 486/33 ISA ET4000 minix-1.6.25++ offscreen 2 0.01 1.45
1.37
% 486/33 ISA ET4000 minix-1.6.25++ onscreen 2 0.01 1.60
1.24
% DX2/66 VLB ET4000/W32i FreeBSD-current offscreen 1.60 0.00
1.59 1.25
% DX2/66 VLB ET4000/W32i FreeBSD-current onscreen 1.70 0.01
1.66 1.18
% 486/33 ISA ET4000 FreeBSD-current offscreen-o 2.30 0.01
2.28 0.87
% 486/33 ISA ET4000 FreeBSD-current onscreen-o 2.39 0.02
2.32 0.84
% 486/33 ISA ET4000 FreeBSD-current offscreen 3.15 0.03
3.10 0.63
% 486/33 ISA ET4000 FreeBSD-current onscreen 3.27 0.00
3.21 0.61
% DX2/66 VLB ET4000/W32i linux-1.2.13 offscreen-o 3.63 0.01
3.62 0.15
% DX2/66 VLB ET4000/W32i linux-1.2.13 onscreen-o 3.65 0.01
3.63 0.55
% DX2/66 VLB ET4000/W32i linux-1.2.13 offscreen 12.48 0.01
12.47 0.16
% 486/33 ISA ET4000 linux-1.1.36 offscreen 20.80 0.00
20.80 0.10
% DX2/66 VLB ET4000/W32i linux-1.2.13 onscreen 26.98 0.01
26.95 0.07
% 486/33 ISA ET4000 linux-1.1.36 onscreen 38.34 0.02
38.38 0.05
The speedup from the worst case (old Linux on old hardware) to the
best case
(old Minix on new hardware) is a factor of 38.34/0.26 = 1475. Hardware
speeds only increased by a factor of about 223/33 = 67. Minix was only
1.5 times faster than syscons and 10-20 times faster than Linux on old
hardware.
% % Scroll:
% % machine video O/S where real user
sys speed
% --------- ------- -------------- --------- ----- ----
----- -----
% A/2223 PCI R9200SE FreeBSD-5.2m onscreen-o .047 0.00
.047 42.6
% A/2223 PCI R9200SE FreeBSD-5.2m offscreen-o .047 0.00
.047 42.6
% A/2223 PCI R9200SE FreeBSD-5.2m onscreen .051 0.00
.051 39.2
% A/2223 PCI R9200SE FreeBSD-5.2m offscreen .051 0.00
.051 39.2
% K6/233 PCI S3/Virge minix-1.6.25++ offscreen 0.2 0.00
0.14 14.0
% K6/233 PCI S3/Virge minix-1.6.25++ onscreen 0.2 0.00
0.14 14.0
% K6/233 PCI S3/Virge FreeBSD-current onscreen-o 0.36 0.00
0.36 5.54
% K6/233 PCI S3/Virge FreeBSD-current offscreen-o 0.40 0.00
0.40 5.01
% K6/233 PCI S3/Virge FreeBSD-current onscreen 0.47 0.00
0.47 4.22
% K6/233 PCI S3/Virge FreeBSD-current offscreen 0.51 0.00
0.51 3.92
Scrolling makes no difference for Minix due to the better virtualization.
It slows down syscons by about 50%. Strangely, the onscreen case is now
faster?!
% P5/133 PCI S3/868 FreeBSD-current onscreen-o 1.24 0.00
1.23 1.61
% P5/133 PCI S3/868 FreeBSD-current offscreen-o 1.28 0.00
1.27 1.56
% P5/133 PCI S3/868 FreeBSD-current onscreen 1.35 0.00
1.34 1.48
% P5/133 PCI S3/868 FreeBSD-current offscreen 1.39 0.00
1.38 1.44
% K6/233 PCI S3/Virge linux-2.1.63 onscreen-o 1.49 0.00
1.49 1.34
% 486/33 ISA ET4000 minix-1.6.25++ offscreen 2 0.00 1.70
1.18
% 486/33 ISA ET4000 minix-1.6.25++ onscreen 2 0.00 1.81
1.10
% K6/233 PCI S3/Virge linux-2.1.63 onscreen 1.85 0.00
1.85 1.08
% K6/233 PCI S3/Virge linux-2.1.63 offscreen-o 2.88 0.00
2.88 0.69
% K6/233 PCI S3/Virge linux-2.1.63 offscreen 3.10 0.00
3.10 0.65
% DX2/66 VLB ET4000/W32i FreeBSD-current offscreen-o 3.39 0.02
3.36 0.59
% DX2/66 VLB ET4000/W32i FreeBSD-current onscreen-o 3.67 0.02
3.63 0.54
% DX2/66 VLB ET4000/W32i FreeBSD-current offscreen 3.82 0.00
3.81 0.52
% DX2/66 VLB ET4000/W32i FreeBSD-current onscreen 4.14 0.03
4.06 0.48
% DX2/66 VLB ET4000/W32i linux-1.2.13 onscreen-o 4.34 0.01
4.32 0.46
% 486/33 ISA ET4000 FreeBSD-current offscreen-o 5.54 0.03
5.48 0.36
% 486/33 ISA ET4000 FreeBSD-current onscreen-o 5.73 0.00
5.61 0.35
% 486/33 ISA ET4000 FreeBSD-current offscreen 6.41 0.03
6.34 0.31
% 486/33 ISA ET4000 FreeBSD-current onscreen 6.62 0.01
6.45 0.30
The old systems didn't have the CPU or frame buffer bandwidth to scroll
at 50000 lines/sec. Rescaling 50000 by this 6.62 divided by the above
0.026
gives only 196 lines/sec. That was usable, but since you can see the
scroll move it is not very good. Rescaling Minix's 2.0 gives 650
lines/sec,
or a full screen refresh rate of 26 Hz. You can probably see the scroll
flicker but not move at this rate. Of course, the implementation does
delayed refresh to reach this rate, so most of the scrolling steps are
virtual and you can only see the screen flicker for other reasons.
syscons'
scrolling is also virtual.
% DX2/66 VLB ET4000/W32i linux-1.2.13 offscreen-o13.48 0.01
13.47 0.15
% DX2/66 VLB ET4000/W32i linux-1.2.13 offscreen 22.60 0.01
22.42 0.09
% 486/33 ISA ET4000 linux-1.1.36 offscreen 23.56 0.03
23.60 0.08
% DX2/66 VLB ET4000/W32i linux-1.2.13 onscreen 27.73 0.01
27.72 0.08
% 486/33 ISA ET4000 linux-1.1.36 onscreen 40.26 0.00
40.27 0.05
Rescaling 50000 by this 40.26 divided by the above 0.026 gives 26
lines/sec.
That is only a bit better than 1982 pixel mode quality. But this is for
text mode.
Bruce