On Thu, 11 Jul 2024 at 06:09, Crystal Kolipe <kolip...@exoticsilicon.com> wrote:
> On Thu, Jul 11, 2024 at 04:25:33AM +0100, ropers wrote: > > It's long been a secret wishlist item for me to solicit/reach agreement > on > > which 256 (possibly 512) code points might merit inclusion in a minimal > > There is already preliminary support for propper UTF-8 handling in the > framebuffer console on OpenBSD. It's still buggy, but work is on-going. > Thank you very much. That's great news. It would be really nice if agreement could reached between all the BSDs (and possibly other Unix-likes) on which characters to include in a minimalist 256--or 512--character subset of Unicode. (If that too is already underway, I may simply be unaware of it and not up to speed.) Since the traditional charset here is that of ISO-8859-1, and since Windows-1252 is both exceedingly good^H^H^H^H common and a superset of ISO-8859-1, the latter looks like a good starting point. (For avoidance of doubt: By superset of the former charset I mean Windows-1252 includes all the characters (or graphemes) present in ISO-8859-1, though not necessarily in the same order.) Again for avoidance of doubt: This would NOT mean OpenBSD's framebuffer console switching to CP1252 or even adopting CP1252 -- no, OpenBSD would still be adopting UTF-8 and UTF-8 only, however on the question of which of the hundreds of thousands of Unicode characters might get one of the 256 limited-edition tickets to "supported on console" prominence, it may not be the worst of ideas to settle on the ones that are in CP1252. So really, we'd be talking about continuing to support what was in ISO-8859-1, and making sensible industry-standard choices in terms of what other characters to admit to the console club. It is my understanding that going for 512-character framebuffer console charsets would require forgoing broader compatibility and the possible use of 16 colours (512-character VGA framebuffer consoles can only do 8 colours.) Thus limiting the subset to 256 characters seems advisable. Even Windows-1252 still leaves a bunch of its 256 spots "unoccupied", i.e. with no grapheme proffered. Notably Windows-1252 and ISO-8859-1 do not define graphemes for the C0 control code characters. New and proper framebuffer console UTF-8 handling routines could use those spots. It would be possible to just put the C0 Control Pictures ( enwp.org/Control_Pictures) there, which might make the plaintext column in (suitably patched) hex editors slightly more informative (fewer dots, more identifiable characters), but those Control Pictures might have issues in some contexts, i.e. with some consoles/terminal emulators, since at least some X11 fonts render them too wide, so if all the Swiss cheese holes align (enwp.org/Swiss_cheese_model), one Control Picture character can end up two monospaced characters wide -- and while framebuffer console fonts are very much controllable, one wouldn't want a situation where something that looks right in the framebuffer console suddenly looks iffy in xterm (or some other X11 terminal emulator) or vice versa. At the peril of sounding like Agent Smith, terminal text mustn't be allowed to escape the matrix. OTOH, maybe bending over backwards to fix some other font designer's mistake is no use or not worth the squeeze. One character I strongly feel should be included in a common minimalist Unicode subset is the U+FFFD � REPLACEMENT CHARACTER, even though some font and tty combos also render that too wide. The above would leave but a literal handful of spots. Windows maps CP1252's 0x81, 8D, 8F, 90, and 9D spots to C1 control codes, but those too are glyphless, and worse, there are no control pictures for C1 codes, so in terms of glyphs and graphemes these spots remain truly empty. I confess, for a while I thought of harebrained schemes, such as using what remains to implement Impulse Tracker-like continuously character-redefining smooth mouse pointer support ( enwp.org/Text-only#Under_DOS_and_Microsoft_Windows) -- or using two-character pairs for little Puffy or Beastie logos -- however, even given Unicode's Private Use Areas, including characters that are not in Unicode is probably not justifiable. Not to mention I wouldn't know how to actually code that. I should probably stop here, lest more abstruse meanderings make me sound any more sectionable than the above already might. Regarding the OP's specific question - if the files being edited only > contain those specific UTF-8 sequences and are otherwise plain ASCII text, > then a simple work-around might be a script that replaces each two-byte > sequence with the corresponding ISO-8859-1 character, writes that to a > temporary file, invokes vi for editing the temporary file, then converts > it back to UTF-8 afterwards. > That is a pretty neat idea. For some value of "simple", I suppose. :-) Of course, this workaround might break in new and interesting ways once what's in the files is no longer strictly limited to two-byte characters also present in the ISO-8859-1 charset. Ian