On Mon, 8 Jul 2024 at 19:05, <rfab...@mhsmail.ch> wrote: > OpenBSD 7.5: In my vi, German umlauts (diaeresis) are displayed as > follows: > Ä: \xc3\x84 > ä: \xc3\xa4 > Ö: \xc3\x96 > ö: \xc3\xb6 > Ü: \xc3\x9c > ü: \xc3\xbc > > These strings appear to consist of 2 character groups, as pressing `x` > 2 times deletes the complete string. > > In man vi(1), I couldn't find anything concerning the file encoding, >
Just because of the way you put that, and at the peril of dumbsplaining Unicode to someone who--generally speaking--quite possibly knows much more than yours truly: What vi(1) displays there are (the hex equivalents of) UTF-8 code units. Whenever old vi(1) Can't Even, it will barf hex, but treat each hex-barf byte as a separate character, even when--as here--the two bytes are but one character. Dunning-Kruger applies: Old vi(1) is ignorant of UTF-8 multi-byte characters and unaware of it. In case you are not familiar with the difference between code points and code units, and how to convert from the former to the latter (and vice versa), Graham Douglas has you covered. This page is an excellent resource and might help you wrap your head around that: http://www.readytext.co.uk/?p=1284 Code points and code units used to be identical in ye olde extended ASCII code pages (like ISO-8859-1, Windows-1252, CP437, etc.), so there weren't even any of these special terms for them back in the day. They were all just the chars in some 256-character charset. However, you're prolly not in Kansas, and they're not identical in most Unicode formats anymore. (Reaching agreement on whether they are to be considered identical in UTF-32, and whether that might mean UTF-32 technically isn't even really a *transformation* format is left as an exercise for interested--if not pugilant--readers.) In the variable-width and frequently multi-byte 8-bit Unicode transformation format UTF-8, code points (U+0000 thru U+10FFFF) can be transformed into anything from one (barely *transformed*, as before) through four 8-bit byte code units. This is done for efficient and ASCII backwards-compatible storage (encoding). Your uppercase A umlaut (LATIN CAPITAL LETTER A WITH DIAERESIS) in code unit form can thus be turned back into its code point form: C3 84 1100 0011 1000 0100 <-- strip leading (110) and continuation byte prefixes (10) 0 0011 00 0100 <-- reformat what remains 000 1100 0100 <-- left-fill with zeroes for complete bytes 0000 0000 1100 0100 00 C4 de-rigueur U+ notation: U+00C4 The umlauts are displayed correctly in xterm. > It's long been a secret wishlist item for me to solicit/reach agreement on which 256 (possibly 512) code points might merit inclusion in a minimal Unicode subset which could then be used to make even ye olde text consoles and console fonts on all the BSDs as Unicode-compatible as they can be. a 256-character subset of UTF-8 > ISO-8859-1 on console Hardware (VGA/BIOS) limitations mean only 256 (possibly 512) characters could make the grade. However--due to personal reasons--I've long been out of it, or rather--confession time--I never got *into* it with OpenBSD as much as I always wanted and would have liked, so due to my never having gotten to the point of "patch productivity", I didn't really even dare ask others for something like that. If there's something like an open mike at EuroBSDcon, and IFF they let me in without signing over my firstborn, perhaps that might be a good place to raise such an issue anyway? Or maybe even to elaborate and dumbsplain UTF-8 and the quirks and history of ASCII, and why RFC 4648 is Considered Harmful, though I rather suspect most of the audience would be way ahead of most of what I might have to say in technical terms, so maybe not. Ian