This does relate to a question I've been thinking about for a while, so even if actually offering diffs for that is still way above my pay grade, I will offer these thoughts:
* Of ASCII's 128 characters, only 95 are actually printable (ASCII sticks 2 thru 7 minus 0x7F DEL).[0] * In principle, the console is capable of supporting 256 glyphs. * With traditional Extended ASCII (EASCII) character sets, more than 95 characters were (still are) printable, but code assuming the use of ISO 8859-1 is deprecated and no longer portable in this age of UTF-8, and for EASCII sticks 8 thru F, there no longer is a direct correspondence between code points and code units at all. * Even if framebuffer console drivers could hypothetically be altered to allow the use of more than 256 glyphs, I completely agree with Ingo that that would be a fairly terrible idea for various reasons. While the 256-glyphs limitation does stem from VGA console drivers permitting no more than 256 text mode glyphs (or 512 with hacks), it would be best to not totally break framebuffer and vga console compatibility, but to stay within those limits. * With "extremely minimalistic UTF-8 support", up to 161 "spots" might be available. * There are 1,112,064 legal Unicode character code points (0x11 * 0x10000 - 0x800, i.e. seventeen 65,536-character planes minus the 2,048 code points from U+D800 thru U+DFFF that are reserved for UTF-16 surrogates). Of those, 137,468 are private use, and 66 are non-characters. If we also subtract the 95 printable ASCII characters, that leaves 974,435 characters that might compete for those 161 spots. * There is an extremely strong argument for accommodating all characters from ISO 8859-1 in any future minimalistic UTF-8 console support. The non-breaking space and soft hyphen could use the same glyphs as space and hyphen-minus, respectively. This means that to maintain maximum backwards compatibility and UTF-8 forward-portability, 94 of those 161 spots would have to be taken, leaving 67. * There might also be a strong argument for accommodating all the characters from ISO 8859-15 (so an additional 8) and Windows-1252,[1] which despite no Unix pedigree is a common superset of ISO 8859-1, with EASCII sticks A thru F being identical to ISO 8859-1. ISO 8859-15 differs from ISO 8859-1 in that it includes 8 characters in sticks A/B that Windows-1252 encodes in sticks 8/9. However, with UTF-8, code units and code points no longer match outside of sticks 0-7, so UTF-8 implementers of ISO 8859-1 and Windows-1252 backwards compatibility get ISO 8859-15 support for free. Besides those 8, Windows-1252 support would consume an additional 19 characters, so we'd have to subtract 27 from those 27 remaining spots, leaving 40. * 32 of those spots are from the C0 control codes from ASCII sticks 0/1. While Bemer et al. did originally propose alternatively printable glyphs for those normally unprintable characters, their glyphs were never commonly used. If "maximum printability" is a criterion, Unicode does define so-called "Control Pictures" for them (U+2400 thru U+241F).[2] It conceivably could be useful to have e.g. a console-based hex editor render something printable for most code units, however attempts at Control Picture inclusion would bump against the technical limitation that the Control Pictures glyphs are already barely legible in X11/xterm: So could actually useful Control Pictures glyphs even be defined if one has just 8x16, 8x14, 8x10 or 8x8 pixels to play with, as may be the case on the console? It seems doubtful. Perhaps those sparse spots and precious pixels are better spent on something else, like Cyrillic for example. * The once-common DOS code page 437 has 31 alternatively printable glyphs for sticks 0/1. Of these, only the bullet point, section sign and paragraph mark (pilcrow) can be found in the ISO 8859/Windows-1252 family. There is no compelling reason --like what's mentioned in footnote [1]-- that could motivate the inclusion of its stick 0/1 glyphs, DOS having largely gone the way of the dodo. Also, full CP437 support would require many more glyphs, support for just this subset of that old code page, never common in Unix-land, would seem wasteful. That still leaves 40 spots that could potentially be used. * The question is, which of the 974,435 candidates deserve one of those 40 spots. With a look at a relevant map[3], Arabic, Cyrillic, and Indic abugidas might have particularly strong claims. Arabic has 28 letters, but many contextual variants (though no case), Cyrillic, or more specifically the Russian alphabet has 33 letters and it does have case, so 40 spots might limit any support to UPPER CASE ONLY, or should I say ЦРРЕЯ СА5Е ОИГУ. I do not feel I know enough about Indic abugidas to say something intelligent. * The question of what subset of Unicode to settle on for minimalistic 256-glyphs-only UTF-8 support might be bigger than OpenBSD. Other Unix-like OSes might ask themselves the same question. Is this something that ought to be standardised across Unix-land or something OpenBSD would want to decide on its own? * I mentioned "512 with hacks" above, but I do not know enough if it could be viable, clean and VGA-compatible to blow past that 256 boundary. If yes, then an additional 256 spots might comfortably allow for the inclusion of many more of the above. * Either way, even if no code is created at this time, just having a roadmap and knowing which glyphs ought to make the cut might be desirable. It would also be possible to already make the font(s) once that is known. Code that actually uses such a font to implement minimalistic UTF-8 support (for the console) need not arrive at the same time. * On the other hand, if extending our minimum character set to cover Windows-1252 and ISO 8859-15 and especially deciding upon the use of the last 40 spots cannot be settled yet, then it might be fine to leave that for later. The existing ISO 8859-1 fonts could actually be useable by a minimalistic UTF-8 support implementation, if developed. Again, once such an implementation has code points properly divorced from code units, it could absolutely source its glyphs from those fonts. That would only leave the small issue of UTF-8 compliance by everything else in base and ports... I hope that was useful and worth the verbiage. Thanks for your time, Ian (Ian Ropers) Footnotes: [0] Yes, they're properly called sticks. 8 sticks of 16 characters in ASCII; 16 sticks of 16 characters in EASCII. See Bob Bemer's Inside ASCII. [1] Per enwp.org/CP1252, Windows-1252 text mislabelled as ISO-8859-1 is still very common (online), and "[m]ost modern web browsers and e-mail clients treat the media type charset ISO-8859-1 as Windows-1252 to accommodate such mislabeling. This is now standard behavior in the HTML5 specification, which requires that documents advertised as ISO-8859-1 actually be parsed with the Windows-1252 encoding." [2] https://en.wikipedia.org/wiki/Control_Pictures [3] https://en.wikipedia.org/wiki/File:Writing_systems_worldwide.png On 05/10/2021, Ingo Schwarze <schwa...@usta.de> wrote: > Hi Slava, > > Slava Voronzoff wrote on Tue, Oct 05, 2021 at 03:01:26PM +0300: > >> I'm working right now on adding cyrillic to Spleen font. How can I later >> add it to OpenBSD kernel and ports? Pull request to main font on github >> (Hi, Frederic) or patch here? > > You cannot add it to the kernel because the kernel does not support > UTF-8, but only US-ASCII, and US-ASCII contains no code points for > cyrillic letters. > > Full UTF-8 support is definitely not wanted in the kernel. Adding > extremely minimalistic UTF-8 support to the kernel is not completely > out of the question, but some developers are likely to feel sceptic even > about that. Consequently, trying to pursue a project of adding anything > related to UTF-8 to the kernel is likely to end in frustration if the > person trying that does not have a significant amount of experience with > getting OpenBSD kernel patches committed. > > I'm sorry that i know absolutely nothing about fonts in ports, maybe > someone else can answer that part of the question. > > Yours, > Ingo > >