At 2023-04-26T19:33:48+0200, Oliver Corff wrote: > On 26/04/2023 15:16, G. Branden Robinson wrote: > > Be sure you review my earlier messages to Oliver in detail. The > > hyphenation code isn't "broken", it's simply limited to the C/C++ > > char type for character code points and hyphenation codes (which are > > not "the same thing as" character code points, but do correspond to > > them). > > I am not familiar with modern incarnations of C/C++. Is there really > no char data type that is Unicode-compliant?
There is. But "Unicode" is a _family_ of standards. There are multiple ways to encode Unicode characters, and those ways are good for different things. Unicode is a 20.1-bit character encoding.[0] For practical purposes, this rounds to 32 bits. So if you use a 32-bit arithmetic type to store a Unicode character, you'll be fine. The type `char32_t` has been around since ISO C11 and ISO C++11 and is arguably the best fit for this purpose, since `int` is not _guaranteed_ to be 32 bits wide.[1] A long, long time ago, people noticed that in real-world texts, the code points used by Unicode strings were not, and were not ever expected to be, anywhere near uniformly distributed within the code space. That fact on top of the baked-in use of only 20.1 bits of a 32-bit type can make use of the latter more wasteful than a lot of people can tolerate. In fact, for much of the text encountered on the Internet--outside of East Asia--the Unicode code points encountered in character strings are extremely heavily weighted toward the left side of the distribution-- specifically, to the first 128 code points, also known as ISO 646 or "ASCII". Along came Unix creator, Ken Thompson, over 20 years after his first major contribution to software engineering. Thompson was a man whose brain took to Huffman coding like a duck to water. Thus was born UTF-8, (which isn't a Huffman code precisely but has properties reminiscent of one) where your ASCII code points would be expressible in one byte, and then the higher the code point you needed to encode, the longer the multi-byte sequence you required. Since the Unicode Consortium had allocated commonly used symbols and alphabetic scripts toward to lower code points in the first place, this meant that even where you needed more than one byte to encode a code point, with UTF-8 you might not need more than two. And as a matter of fact, under UTF-8, every character in every code block up to and including NKo is expressible using up to two bytes.[2] So UTF-8 is pretty great at not being wasteful, but it does have downsides. It is more expensive to process than traditional byte-wide strings. It has state. When you see a byte with its high bit set, you know that it begins or continues a UTF-8 sequence. Not all such sequences are valid. You have to decide what to do if a multibyte UTF-8 sequence is truncated. If you _write_ UTF-8, you have to know how to do so; the mapping from an ISO 10646-1 20.1-bit code point to a UTF-8 sequence is not trivial. groff, like AT&T nroff (and roff(1) before it?), doesn't handle a large quantity of character strings at one time, relative to the overall size of its typical inputs. It follows the Unix filter model and does not absorb an entire input before processing it. It accumulates input lines until it is time to emit an output line (either to the output stream or to a diversion), then flushes the output line and moves on. It would probably be a good idea to represent Unicode strings internally using char32_t as a base type anyway, but groff's design under the Unix filter model described above makes the choice less dramatic in terms of increased space consumption than it would otherwise be. The formatter is not even instrumented to measure the output node lists it builds. If this were done, we could tell exactly what the cost of moving from `char` to `char32_t` is. And if I'm the person who ends up doing this work, maybe I will collect such measurements. "Unicode-compliant" is not a precise enough term to mean much of anything. Unicode is a family of standards; the core of which most people of heard, and a pile of "standard annexes" which can be really important in various domains of practical engineering around this encoding system.[3] There exist many resources to assist us with converting UTF-8 to and from 32-bit code points. For instance, there is GNU libunistring. I've studied the problems it identifies with "the char32_t approach"[4] and I don't think they apply to groff. Maybe gnulib, which we already use, has some facilities as well. I'm not sure--it's huge and I've only recently started familiarizing myself with its manual. I've been pointedly ignoring two other encodings of Unicode strings: UTF-16LE and UTF-16BE. They are terrible but we can't completely avoid them; Microsoft and Adobe are deeply wedded to them and while we can largely ignore Microsoft, Adobe managed to contaminate the international standard for PDF with UTF-16 (in its LE form, I assume[5]). This is a practical concern for us, because sometimes PDF bookmarks need non-ASCII characters in them. We therefore need a way to express such characters in device control requests and escape sequences (e.g., `\X`), and out PDF output driver (gropdf(1)) needs to handle those appropriately. Regards, Branden [0] If you're like me, the idea of a "20.1-bit" quantity sounds weird. You can't encode a tenth of a bit in a single logic gate, or one position in a machine register. The key is to think in terms of information theory, not digital logic. Unicode has decided that its range of valid code points is zero to 0x10FFFF. That's 1114111 decimal. That number (plus one for code point 0) is the number of distinct characters encodable in Unicode. The base 2 logarithm of that is... $ python3 -c "import math; print(math.log(1114112, 2))" 20.087462841250343 [1] I think that guarantee holds, as a minimum, under POSIX. But an `int` could be 64 bits wide. Using an int to store a Unicode character is a waste of space on such systems. [2] https://en.wikipedia.org/wiki/Unicode_block [3] https://unicode.org/versions/Unicode15.0.0/ [4] https://www.gnu.org/software/libunistring/manual/libunistring.html#The-char32_005ft-problem [5] I should _remember_ because we had a problem with Poppler's pdfinfo(1) tool output that is fixed for groff 1.23.0. But I don't recollect. Poppler shouldn't have handed us that noise in the first place but its development community seems to be hostile to much of the software ecosystem around it. "some font thing failed", etc.
signature.asc
Description: PGP signature