>> However, please avoid the term 'AGL compatible'. We are not >> talking about glyphs but about characters! > > Maybe this confusion is not only my fault also the manual's: > > The distinction between input, characters, and output, glyphs, > is not clearly separated in the terminology of groff; for > example, the char request should be called glyph since it > defines an output entity.
Yes, this is a mess, introduced long before I was involved in groff. However, this affects the terminology but not the documentation itself. AFAIK, I've fixed all places in the doc files to make a clear distinction between input characters and output glyphs. > As I now understand, there's no internal representation of > characters in groff. There are only input characters and output > entities. Actually, it's the same object, containing a reference to both the input character and the corresponding output glyph. > The former ones are found on the input stream, sometimes together > with escape sequences specifying output entities directly -- like > \[uXXXX] with Russian UTF-8 input after processing by preconv. The > latter ones are stored in groff's intermediate output and read in by > postprocessors. This is correct, more or less. > If the postprocessor is not targetting a character-cell device, then > these output entities are also called glyphs, but they are not to be > confused with, say, the glyphs of a PostScript font, about which > groff itself knows nothing and it's the grops postprocessor that, > using its font-definition files, converts groff's glyphs into > PostScript glyphs. Correct. > The Groff Glyph List (GGL) is just a fixed set of glyph identifiers > without a predefined mapping either from input characters, which is > defined by character translation requests like .trin in groff source > files, or to the symbols in the resulting document, because it is up > to the postprocessor whether (and how) to interpret them. Correct. > It seems to me that the GGL was created to provide a default support > for 8-bit encodings that would work out-of-the-box, No. The GGL has been modeled after the AGL (using similar rules to construct some glyph names algorithmically) and the LICR, the LaTeX Input Character Reportoire, which is also a collection of internal entities. > and to have meaningful indentifiers for the symbols of the non-ASCII > part of the Latin-1 encoding, thereby standardizing the names of > these 8-bit symbols across all postprocessors. troff always had a lot of glyphs which are neither ASCII nor latin-1! It's better to avoid the term `8-bit' here since it sounds like there were a limitation to 256 entities. > It probably came into existence when the hard-coded dependency on > Latin-1 was removed, because now the font files had to substitute > something for glyph names \[char128]-\[char255] which they had > relied upon. Yes, and to harmonize all entity names across all devices. In particular, devdvi had some conflicting names. > Am I correct in suggesting that the Adobe Glyph List algorithm is > used in afmtodit? Kind of. It's a simplified one, and the resulting glyph names are tailored for groff. >> Contrary to TeX, groff handles hyphenation before the conversion >> from characters to glyphs has happened (more or less). > > More or less, because the input file may already contain escapes for > addressing output entities directly, in which case groff has to > convert them to 'phantom' input characters which were never on the > input yet must be used for hyphenation. Yes. Another reason is that the character and glyph representation share the same structure; this means that there is not a strict hyphenation character -----------------> glyph model. >> > But generally, this map cannot be inversely applied becuase >> > several input characters may be mapped into one internal >> > entity. What does groff do in this case? >> >> Please give me an example where this is relevant to hyphenation. > > An error in the mapping file, like this: > > .trin a\[u0430] > .trin b\[u0430] > > makes it impossible for groff to calculate the hyphenation code for > \[u0430], yet otherwise such a setup using UTF-8 input remains fully > functional. This is not the example I have expected. In other words, you don't have one, which is good :-) Werner