On 10/03/2024 15:16, Nick wrote:
I'm attempting to learn about UTF-8. My question is about how wc
counts "combining characters", as discussed here
<https://www.cl.cam.ac.uk/~mgk25/unicode.html#comb>.
I made two files, one with "LATIN CAPITAL LETTER A WITH DIAERESIS"
called p1.txt. The other with "LATIN CAPITAL LETTER A" followed by
"COMBINING DIAERESIS", called p2.txt. Neither file contained a
newline or any other bytes.
$ od --format=x1 p1.txt
0000000 c3 84
0000002
$ od --format=x1 p2.txt
0000000 41 cc 88
0000003
My question is: why does wc say that p2.txt contains two characters?
$ wc -m -c p?.txt
1 2 p1.txt
2 3 p2.txt
3 5 total
I'd naively expected that second line of output to start with 1,
i.e. saying the file p2.txt has one character. Markus Kuhn's FAQ says
"A combining character is not a full character by itself" but wc is
saying that it is, no?
Sorry if this has already been done to death. My search of the archives
failed to find a previous discussion but perhaps I missed them.
Thanks
It's a fair point. Libre Office for example will count as one character.
It will also count it as one character if it doesn't follow another character.
But then again wc(1) is lower level tool, and will for example count '\n'
as a character, while Libre Office will not.
Note wc(1) doesn't assign the separate character a width at least:
$ printf "\x41\xcc\x88" | wc -L
1
cheers,
Pádraig