On 05/10/2021, ropers <rop...@gmail.com> wrote: > This does relate to a question I've been thinking about for a while, > so even if actually offering diffs for that is still way above my pay > grade, I will offer these thoughts: > > * Of ASCII's 128 characters, only 95 are actually printable (ASCII > sticks 2 thru 7 minus 0x7F DEL).[0] > * In principle, the console is capable of supporting 256 glyphs. > * With traditional Extended ASCII (EASCII) character sets, more than > 95 characters were (still are) printable, but code assuming the use of > ISO 8859-1 is deprecated and no longer portable in this age of UTF-8, > and for EASCII sticks 8 thru F, there no longer is a direct > correspondence between code points and code units at all. > * Even if framebuffer console drivers could hypothetically be altered > to allow the use of more than 256 glyphs, I completely agree with Ingo > that that would be a fairly terrible idea for various reasons. While > the 256-glyphs limitation does stem from VGA console drivers > permitting no more than 256 text mode glyphs (or 512 with hacks), it > would be best to not totally break framebuffer and vga console > compatibility, but to stay within those limits. > * With "extremely minimalistic UTF-8 support", up to 161 "spots" might > be available. > * There are 1,112,064 legal Unicode character code points (0x11 * > 0x10000 - 0x800, i.e. seventeen 65,536-character planes minus the > 2,048 code points from U+D800 thru U+DFFF that are reserved for UTF-16 > surrogates). Of those, 137,468 are private use, and 66 are > non-characters. If we also subtract the 95 printable ASCII > characters, that leaves 974,435 characters that might compete for > those 161 spots. > * There is an extremely strong argument for accommodating all > characters from ISO 8859-1 in any future minimalistic UTF-8 console > support. The non-breaking space and soft hyphen could use the same > glyphs as space and hyphen-minus, respectively. This means that to > maintain maximum backwards compatibility and UTF-8 > forward-portability, 94 of those 161 spots would have to be taken, > leaving 67. > * There might also be a strong argument for accommodating all the > characters from ISO 8859-15 (so an additional 8) and Windows-1252,[1] > which despite no Unix pedigree is a common superset of ISO 8859-1, > with EASCII sticks A thru F being identical to ISO 8859-1. ISO > 8859-15 differs from ISO 8859-1 in that it includes 8 characters in > sticks A/B that Windows-1252 encodes in sticks 8/9. However, with > UTF-8, code units and code points no longer match outside of sticks > 0-7, so UTF-8 implementers of ISO 8859-1 and Windows-1252 backwards > compatibility get ISO 8859-15 support for free. Besides those 8, > Windows-1252 support would consume an additional 19 characters, so > we'd have to subtract 27 from those 27 remaining spots, leaving 40.
s/27 remaining/67 remaining > * 32 of those spots are from the C0 control codes from ASCII sticks > 0/1. While Bemer et al. did originally propose alternatively > printable glyphs for those normally unprintable characters, their > glyphs were never commonly used. If "maximum printability" is a > criterion, Unicode does define so-called "Control Pictures" for them > (U+2400 thru U+241F).[2] It conceivably could be useful to have e.g. > a console-based hex editor render something printable for most code > units, however attempts at Control Picture inclusion would bump > against the technical limitation that the Control Pictures glyphs are > already barely legible in X11/xterm: So could actually useful Control > Pictures glyphs even be defined if one has just 8x16, 8x14, 8x10 or > 8x8 pixels to play with, as may be the case on the console? It seems > doubtful. Perhaps those sparse spots and precious pixels are better > spent on something else, like Cyrillic for example. > * The once-common DOS code page 437 has 31 alternatively printable > glyphs for sticks 0/1. Of these, only the bullet point, section sign > and paragraph mark (pilcrow) can be found in the ISO 8859/Windows-1252 > family. There is no compelling reason --like what's mentioned in > footnote [1]-- that could motivate the inclusion of its stick 0/1 > glyphs, DOS having largely gone the way of the dodo. Also, full CP437 > support would require many more glyphs, support for just this subset s/support for/so support for > of that old code page, never common in Unix-land, would seem wasteful. > That still leaves 40 spots that could potentially be used. > * The question is, which of the 974,435 candidates deserve one of > those 40 spots. With a look at a relevant map[3], Arabic, Cyrillic, > and Indic abugidas might have particularly strong claims. Arabic has > 28 letters, but many contextual variants (though no case), Cyrillic, > or more specifically the Russian alphabet has 33 letters and it does > have case, so 40 spots might limit any support to UPPER CASE ONLY, or > should I say ЦРРЕЯ СА5Е ОИГУ. I do not feel I know enough about Indic > abugidas to say something intelligent. > * The question of what subset of Unicode to settle on for minimalistic > 256-glyphs-only UTF-8 support might be bigger than OpenBSD. Other > Unix-like OSes might ask themselves the same question. Is this > something that ought to be standardised across Unix-land or something > OpenBSD would want to decide on its own? > * I mentioned "512 with hacks" above, but I do not know enough if it > could be viable, clean and VGA-compatible to blow past that 256 > boundary. If yes, then an additional 256 spots might comfortably > allow for the inclusion of many more of the above. > * Either way, even if no code is created at this time, just having a > roadmap and knowing which glyphs ought to make the cut might be > desirable. It would also be possible to already make the font(s) once > that is known. Code that actually uses such a font to implement > minimalistic UTF-8 support (for the console) need not arrive at the > same time. > * On the other hand, if extending our minimum character set to cover > Windows-1252 and ISO 8859-15 and especially deciding upon the use of > the last 40 spots cannot be settled yet, then it might be fine to > leave that for later. The existing ISO 8859-1 fonts could actually be > useable by a minimalistic UTF-8 support implementation, if developed. > Again, once such an implementation has code points properly divorced > from code units, it could absolutely source its glyphs from those > fonts. > That would only leave the small issue of UTF-8 compliance by > everything else in base and ports... > > I hope that was useful and worth the verbiage. > > Thanks for your time, > Ian > > (Ian Ropers) > > Footnotes: > [0] Yes, they're properly called sticks. 8 sticks of 16 characters in > ASCII; 16 sticks of 16 characters in EASCII. See Bob Bemer's Inside > ASCII. > [1] Per enwp.org/CP1252, Windows-1252 text mislabelled as ISO-8859-1 > is still very common (online), and "[m]ost modern web browsers and > e-mail clients treat the media type charset ISO-8859-1 as Windows-1252 > to accommodate such mislabeling. This is now standard behavior in the > HTML5 specification, which requires that documents advertised as > ISO-8859-1 actually be parsed with the Windows-1252 encoding." > [2] https://en.wikipedia.org/wiki/Control_Pictures > [3] https://en.wikipedia.org/wiki/File:Writing_systems_worldwide.png > > > On 05/10/2021, Ingo Schwarze <schwa...@usta.de> wrote: >> Hi Slava, >> >> Slava Voronzoff wrote on Tue, Oct 05, 2021 at 03:01:26PM +0300: >> >>> I'm working right now on adding cyrillic to Spleen font. How can I later >>> add it to OpenBSD kernel and ports? Pull request to main font on github >>> (Hi, Frederic) or patch here? >> >> You cannot add it to the kernel because the kernel does not support >> UTF-8, but only US-ASCII, and US-ASCII contains no code points for >> cyrillic letters. >> >> Full UTF-8 support is definitely not wanted in the kernel. Adding >> extremely minimalistic UTF-8 support to the kernel is not completely >> out of the question, but some developers are likely to feel sceptic even >> about that. Consequently, trying to pursue a project of adding anything >> related to UTF-8 to the kernel is likely to end in frustration if the >> person trying that does not have a significant amount of experience with >> getting OpenBSD kernel patches committed. >> >> I'm sorry that i know absolutely nothing about fonts in ports, maybe >> someone else can answer that part of the question. >> >> Yours, >> Ingo >> >> >