On Thu, 11 Jul 2024 at 06:09, Crystal Kolipe <kolip...@exoticsilicon.com>
wrote:

> On Thu, Jul 11, 2024 at 04:25:33AM +0100, ropers wrote:
> > It's long been a secret wishlist item for me to solicit/reach agreement
> on
> > which 256 (possibly 512) code points might merit inclusion in a minimal
>
> There is already preliminary support for propper UTF-8 handling in the
> framebuffer console on OpenBSD.  It's still buggy, but work is on-going.
>

Thank you very much. That's great news.

It would be really nice if agreement could reached between all the BSDs
(and possibly other Unix-likes) on which characters to include in a
minimalist 256--or 512--character subset of Unicode. (If that too is
already underway, I may simply be unaware of it and not up to speed.)

Since the traditional charset here is that of ISO-8859-1, and since
Windows-1252 is both exceedingly good^H^H^H^H common and a superset of
ISO-8859-1, the latter looks like a good starting point. (For avoidance of
doubt: By superset of the former charset I mean Windows-1252 includes all
the characters (or graphemes) present in ISO-8859-1, though not necessarily
in the same order.)

Again for avoidance of doubt:
This would NOT mean OpenBSD's framebuffer console switching to CP1252 or
even adopting CP1252 -- no, OpenBSD would still be adopting UTF-8 and UTF-8
only, however on the question of which of the hundreds of thousands of
Unicode characters might get one of the 256 limited-edition tickets to
"supported on console" prominence, it may not be the worst of ideas to
settle on the ones that are in CP1252. So really, we'd be talking about
continuing to support what was in ISO-8859-1, and making sensible
industry-standard choices in terms of what other characters to admit to the
console club.

It is my understanding that going for 512-character framebuffer console
charsets would require forgoing broader compatibility and the possible use
of 16 colours (512-character VGA framebuffer consoles can only do 8
colours.) Thus limiting the subset to 256 characters seems advisable.

Even Windows-1252 still leaves a bunch of its 256 spots "unoccupied", i.e.
with no grapheme proffered. Notably Windows-1252 and ISO-8859-1 do not
define graphemes for the C0 control code characters. New and proper
framebuffer console UTF-8 handling routines could use those spots.

It would be possible to just put the C0 Control Pictures (
enwp.org/Control_Pictures) there, which might make the plaintext column in
(suitably patched) hex editors slightly more informative (fewer dots, more
identifiable characters), but those Control Pictures might have issues in
some contexts, i.e. with some consoles/terminal emulators, since at least
some X11 fonts render them too wide, so if all the Swiss cheese holes align
(enwp.org/Swiss_cheese_model), one Control Picture character can end up two
monospaced characters wide -- and while framebuffer console fonts are very
much controllable, one wouldn't want a situation where something that looks
right in the framebuffer console suddenly looks iffy in xterm (or some
other X11 terminal emulator) or vice versa. At the peril of sounding like
Agent Smith, terminal text mustn't be allowed to escape the matrix. OTOH,
maybe bending over backwards to fix some other font designer's mistake is
no use or not worth the squeeze.

One character I strongly feel should be included in a common minimalist
Unicode subset is the U+FFFD � REPLACEMENT CHARACTER, even though some font
and tty combos also render that too wide.

The above would leave but a literal handful of spots. Windows maps CP1252's
0x81, 8D, 8F, 90, and 9D spots to C1 control codes, but those too are
glyphless, and worse, there are no control pictures for C1 codes, so in
terms of glyphs and graphemes these spots remain truly empty.
I confess, for a while I thought of harebrained schemes, such as using what
remains to implement Impulse Tracker-like continuously character-redefining
smooth mouse pointer support (
enwp.org/Text-only#Under_DOS_and_Microsoft_Windows) -- or using
two-character pairs for little Puffy or Beastie logos -- however, even
given Unicode's Private Use Areas, including characters that are not in
Unicode is probably not justifiable. Not to mention I wouldn't know how to
actually code that. I should probably stop here, lest more abstruse
meanderings make me sound any more sectionable than the above already might.

Regarding the OP's specific question - if the files being edited only
> contain those specific UTF-8 sequences and are otherwise plain ASCII text,
> then a simple work-around might be a script that replaces each two-byte
> sequence with the corresponding ISO-8859-1 character, writes that to a
> temporary file, invokes vi for editing the temporary file, then converts
> it back to UTF-8 afterwards.
>

That is a pretty neat idea. For some value of "simple", I suppose. :-)
Of course, this workaround might break in new and interesting ways once
what's in the files is no longer strictly limited to two-byte characters
also present in the ISO-8859-1 charset.

Ian

Reply via email to