Currently it is not possible to use unicode codepoints > 0xFF on the console,
because our UTF-8 decoding logic is badly broken.
The code in question is in wsemul_subr.c, wsemul_getchar().
The problem is that we calculate the number of bytes in a multi-byte
sequence by just looking at the high bits in turn:
if (frag & 0x20) {
frag &= ~0x20;
mbleft++;
}
if (frag & 0x10) {
frag &= ~0x10;
mbleft++;
}
if (frag & 0x08) {
frag &= ~0x08;
mbleft++;
}
if (frag & 0x04) {
frag &= ~0x04;
mbleft++;
}
This is wrong, for several reasons.
Firstly, since about 20 years ago, the maximum number of bytes in a UTF-8
sequence has been four, so we shouldn't be checking 0x08 and 0x04, (or rather
we should only check that 0x08 is 0 when 0x10 is 1 to indicate a four-byte
sequence.
Secondly, the check for 0x10 should only be performed when 0x20 is also set.
By chance, the current logic successfully decodes UTF-8 encodings of unicode
codepoints 0x80 - 0xFF, because these don't touch bits 2-4 of the first byte.
However, to use console fonts with more than 256 characters we need this
fixed. I created a font with an extra glyph at position 0x100, and am able to
use it once I had applied the attached patch.
The UTF-8 decoder still needs more work done on it to reject invalid
sequences such as over long encodings and the UTF-16 surrogates.
But it would be nice to get at least this fix in as it is trivial and allows
further experimentation with UTF-8 on the console using fonts with more than
256 glyphs.
I'll do a more detailed write-up about this at some time, but since I've
already had questions off-list about "why OpenBSD doesn't support more than
256 characters in a font", since I started posting the console patches, I
thought it would be good to get this patch out there.
--- wsemul_subr.c.dist Fri Oct 18 19:06:41 2013
+++ wsemul_subr.c Sat Feb 25 13:58:00 2023
@@ -125,20 +125,11 @@
if (frag & 0x20) {
frag &= ~0x20;
mbleft++;
+ if (frag & 0x10) {
+ frag &= ~0x10;
+ mbleft++;
+ }
}
- if (frag & 0x10) {
- frag &= ~0x10;
- mbleft++;
- }
- if (frag & 0x08) {
- frag &= ~0x08;
- mbleft++;
- }
- if (frag & 0x04) {
- frag &= ~0x04;
- mbleft++;
- }
-
tmpchar = frag;
}
}