On Mon, 8 Jul 2024 at 19:05, <rfab...@mhsmail.ch> wrote:

> OpenBSD 7.5: In my vi, German umlauts (diaeresis) are displayed as
> follows:
> Ä: \xc3\x84
> ä: \xc3\xa4
> Ö: \xc3\x96
> ö: \xc3\xb6
> Ü: \xc3\x9c
> ü: \xc3\xbc
>
> These strings appear to consist of 2 character groups, as pressing `x`
> 2 times deletes the complete string.
>
> In man vi(1), I couldn't find anything concerning the file encoding,
>

Just because of the way you put that, and at the peril of dumbsplaining
Unicode to someone who--generally speaking--quite possibly knows much more
than yours truly:
What vi(1) displays there are (the hex equivalents of) UTF-8 code units.
Whenever old vi(1) Can't Even, it will barf hex, but treat each hex-barf
byte as a separate character, even when--as here--the two bytes are but one
character. Dunning-Kruger applies: Old vi(1) is ignorant of UTF-8
multi-byte characters and unaware of it.

In case you are not familiar with the difference between code points and
code units, and how to convert from the former to the latter (and vice
versa), Graham Douglas has you covered. This page is an excellent resource
and might help you wrap your head around that:
http://www.readytext.co.uk/?p=1284

Code points and code units used to be identical in ye olde extended ASCII
code pages (like ISO-8859-1, Windows-1252, CP437, etc.), so there weren't
even any of these special terms for them back in the day. They were all
just the chars in some 256-character charset. However, you're prolly not in
Kansas, and they're not identical in most Unicode formats anymore.
(Reaching agreement on whether they are to be considered identical in
UTF-32, and whether that might mean UTF-32 technically isn't even really a
*transformation* format is left as an exercise for interested--if not
pugilant--readers.)
In the variable-width and frequently multi-byte 8-bit Unicode
transformation format UTF-8, code points (U+0000 thru U+10FFFF) can be
transformed into anything from one (barely *transformed*, as before)
through four 8-bit byte code units. This is done for efficient and ASCII
backwards-compatible storage (encoding).
Your uppercase A umlaut (LATIN CAPITAL LETTER A WITH DIAERESIS) in code
unit form can thus be turned back into its code point form:
C3  84
1100 0011  1000 0100 <-- strip leading (110) and continuation byte prefixes
(10)
0 0011  00 0100 <-- reformat what remains
000  1100 0100 <-- left-fill with zeroes for complete bytes
0000 0000  1100 0100
00  C4
de-rigueur U+ notation: U+00C4

The umlauts are displayed correctly in xterm.
>

It's long been a secret wishlist item for me to solicit/reach agreement on
which 256 (possibly 512) code points might merit inclusion in a minimal
Unicode subset which could then be used to make even ye olde text consoles
and console fonts on all the BSDs as Unicode-compatible as they can be.

a 256-character subset of UTF-8 > ISO-8859-1 on console

Hardware (VGA/BIOS) limitations mean only 256 (possibly 512) characters
could make the grade. However--due to personal reasons--I've long been out
of it, or rather--confession time--I never got *into* it with OpenBSD as
much as I always wanted and would have liked, so due to my never having
gotten to the point of "patch productivity", I didn't really even dare ask
others for something like that. If there's something like an open mike at
EuroBSDcon, and IFF they let me in without signing over my firstborn,
perhaps that might be a good place to raise such an issue anyway? Or maybe
even to elaborate and dumbsplain UTF-8 and the quirks and history of ASCII,
and why RFC 4648 is Considered Harmful, though I rather suspect most of the
audience would be way ahead of most of what I might have to say in
technical terms, so maybe not.

Ian

Reply via email to