Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

G. Branden Robinson Wed, 26 Apr 2023 20:07:59 -0700

At 2023-04-26T19:33:48+0200, Oliver Corff wrote:
> On 26/04/2023 15:16, G. Branden Robinson wrote:
> > Be sure you review my earlier messages to Oliver in detail.  The
> > hyphenation code isn't "broken", it's simply limited to the C/C++
> > char type for character code points and hyphenation codes (which are
> > not "the same thing as" character code points, but do correspond to
> > them).
> 
> I am not familiar with modern incarnations of C/C++. Is there really
> no char data type that is Unicode-compliant?

There is. But "Unicode" is a _family_ of standards. There are multiple
ways to encode Unicode characters, and those ways are good for different
things.

Unicode is a 20.1-bit character encoding.[0] For practical purposes,
this rounds to 32 bits. So if you use a 32-bit arithmetic type to store
a Unicode character, you'll be fine. The type `char32_t` has been
around since ISO C11 and ISO C++11 and is arguably the best fit for this
purpose, since `int` is not _guaranteed_ to be 32 bits wide.[1]

A long, long time ago, people noticed that in real-world texts, the code
points used by Unicode strings were not, and were not ever expected to
be, anywhere near uniformly distributed within the code space. That
fact on top of the baked-in use of only 20.1 bits of a 32-bit type can
make use of the latter more wasteful than a lot of people can tolerate.

In fact, for much of the text encountered on the Internet--outside of
East Asia--the Unicode code points encountered in character strings are
extremely heavily weighted toward the left side of the distribution--
specifically, to the first 128 code points, also known as ISO 646 or
"ASCII".

Along came Unix creator, Ken Thompson, over 20 years after his first
major contribution to software engineering. Thompson was a man whose
brain took to Huffman coding like a duck to water. Thus was born UTF-8,
(which isn't a Huffman code precisely but has properties reminiscent of
one) where your ASCII code points would be expressible in one byte, and
then the higher the code point you needed to encode, the longer the
multi-byte sequence you required. Since the Unicode Consortium had
allocated commonly used symbols and alphabetic scripts toward to lower
code points in the first place, this meant that even where you needed
more than one byte to encode a code point, with UTF-8 you might not need
more than two. And as a matter of fact, under UTF-8, every character in
every code block up to and including NKo is expressible using up to two
bytes.[2]

So UTF-8 is pretty great at not being wasteful, but it does have
downsides. It is more expensive to process than traditional byte-wide
strings. It has state. When you see a byte with its high bit set, you
know that it begins or continues a UTF-8 sequence. Not all such
sequences are valid. You have to decide what to do if a multibyte UTF-8
sequence is truncated. If you _write_ UTF-8, you have to know how to do
so; the mapping from an ISO 10646-1 20.1-bit code point to a UTF-8
sequence is not trivial.

groff, like AT&T nroff (and roff(1) before it?), doesn't handle a large
quantity of character strings at one time, relative to the overall size
of its typical inputs. It follows the Unix filter model and does not
absorb an entire input before processing it. It accumulates input lines
until it is time to emit an output line (either to the output stream or
to a diversion), then flushes the output line and moves on.

It would probably be a good idea to represent Unicode strings internally
using char32_t as a base type anyway, but groff's design under the Unix
filter model described above makes the choice less dramatic in terms of
increased space consumption than it would otherwise be. The formatter
is not even instrumented to measure the output node lists it builds. If
this were done, we could tell exactly what the cost of moving from
`char` to `char32_t` is. And if I'm the person who ends up doing this
work, maybe I will collect such measurements.

"Unicode-compliant" is not a precise enough term to mean much of
anything. Unicode is a family of standards; the core of which most
people of heard, and a pile of "standard annexes" which can be really
important in various domains of practical engineering around this
encoding system.[3]

There exist many resources to assist us with converting UTF-8 to and
from 32-bit code points. For instance, there is GNU libunistring. I've
studied the problems it identifies with "the char32_t approach"[4] and I
don't think they apply to groff.

Maybe gnulib, which we already use, has some facilities as well. I'm
not sure--it's huge and I've only recently started familiarizing myself
with its manual.

I've been pointedly ignoring two other encodings of Unicode strings:
UTF-16LE and UTF-16BE. They are terrible but we can't completely avoid
them; Microsoft and Adobe are deeply wedded to them and while we can
largely ignore Microsoft, Adobe managed to contaminate the international
standard for PDF with UTF-16 (in its LE form, I assume[5]). This is a
practical concern for us, because sometimes PDF bookmarks need non-ASCII
characters in them. We therefore need a way to express such characters
in device control requests and escape sequences (e.g., `\X`), and out
PDF output driver (gropdf(1)) needs to handle those appropriately.

Regards,
Branden

[0] If you're like me, the idea of a "20.1-bit" quantity sounds weird.
You can't encode a tenth of a bit in a single logic gate, or one
position in a machine register. The key is to think in terms of
information theory, not digital logic. Unicode has decided that its
range of valid code points is zero to 0x10FFFF. That's 1114111
decimal. That number (plus one for code point 0) is the number of
distinct characters encodable in Unicode. The base 2 logarithm of
that is...

$ python3 -c "import math; print(math.log(1114112, 2))"
20.087462841250343

[1] I think that guarantee holds, as a minimum, under POSIX. But an
`int` could be 64 bits wide. Using an int to store a Unicode
character is a waste of space on such systems.

[2] https://en.wikipedia.org/wiki/Unicode_block

[3] https://unicode.org/versions/Unicode15.0.0/

[4]
https://www.gnu.org/software/libunistring/manual/libunistring.html#The-char32_005ft-problem

[5] I should _remember_ because we had a problem with Poppler's
pdfinfo(1) tool output that is fixed for groff 1.23.0. But I don't
recollect. Poppler shouldn't have handed us that noise in the first
place but its development community seems to be hostile to much of
the software ecosystem around it. "some font thing failed", etc.

signature.asc
Description: PGP signature

Re: neatroff for Russian. (Was: Questions concerning hyphenation patterns for non-Latin languages, e.g. Russian)

Reply via email to