-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Russ Allbery wrote: [...] > Okay, your analysis matches what I thought was going on. However, David > Given seems to be seeing something else where some man pages are already > encoded in UTF-8. So I guess I'm confused as to what's going on and what > the current status is.
I've only got a handful of them. Here's one: vim-common: /usr/share/man/it.UTF-8/man1/rvim.1.gz That's vim-common 1:7.0-122+1etch2. Here's the relevant comment from the source of man-db: /* Due to historical limitations in groff (which may be removed in the * future), there is no mechanism for a man page to specify its own * encoding. This means that each national language directory needs to carry * with it information about its encoding, and each groff device needs to * have a default encoding associated with it. Out of the box, groff * formally allows only ISO-8859-1 on input; however, patches originating * with Debian and imported by many other GNU/Linux distributions change * this somewhat. * * Eventually, groff will support proper Unicode input, and much of this * horror can go away. * * Do *not* confuse source encoding with groff encoding. The encoding * specified in this table is the encoding in which the source man pages in * each language directory are expected to be written. The groff encoding is * determined by the selected groff device and sometimes also by the user's * locale. * * The standard output encoding is the encoding assumed for cat pages for * each language directory. It must *not* be used to discover the actual * output encoding displayed to the user; that is determined by the locale. * TODO: it would be useful to be able to change the standard output * encoding in the configuration file. * * This table is expected to change over time, particularly as man pages * begin to move towards UTF-8. Feel free to patch this for your * distribution; send me updates for languages I've missed. * * Explicit encodings in the directory name (e.g. de_DE.UTF-8) override this * table. */ (man-db-2.4.3/src/encodings.c) > If our groff really can handle UTF-8 input and is doing so for some > locales, I'd love to declare all regular man pages are in UTF-8 and be > done with it; that's a change that we can probably make without backward > compatibility issues right now, since currently those code points are > disallowed. Weeeell... unfortunately man-db uses ISO-8859-1 for C and POSIX locales, so transcoding would be required. Further investigation reveals that man-db seems to transcode UTF-8 to ISO-8859-1 before passing it to groff. man-db has three tables. This one tells it what encoding to use for each locale: { "C", "ISO-8859-1", "ANSI_X3.4-1968" }, /* English */ { "POSIX", "ISO-8859-1", "ANSI_X3.4-1968" }, /* English */ #ifdef MULTIBYTE_GROFF /* These languages require a patched version of groff with the * ascii8 and nippon devices. */ { "ja", "EUC-JP", "EUC-JP" }, /* Japanese */ { "ko", "EUC-KR", "EUC-KR" }, /* Korean */ ... The two columns seem to be: encoding man page is written in, encoding to use when saving in cat page. This one tells it what output device to use: { "ANSI_X3.4-1968", "ascii" }, { "ISO-8859-1", "latin1" }, { "ISO-8859-15", "latin1" }, { "UTF-8", "utf8" }, #ifdef MULTIBYTE_GROFF { "EUC-JP", "nippon" }, #endif /* MULTIBYTE_GROFF */ And this one tells it what encoding to pass in to each groff device: { "ascii", "ISO-8859-1", "ANSI_X3.4-1968" }, { "latin1", "ISO-8859-1", "ISO-8859-1" }, { "utf8", "ISO-8859-1", "UTF-8" }, #ifdef MULTIBYTE_GROFF { "ascii8", NULL, NULL }, { "nippon", "EUC-JP", "EUC-JP" }, (Columns are: encoding to pass into groff, encoding passed out of groff.) Note that if utf8 is selected as the output device, which appears to happen if the source encoding is UTF-8, the groff source encoding is specified as ISO-8859-1 and a transcode happens. It's all a bit of a maze, unfortunately, and I could have misunderstood things. But that MULTIBYTE_GROFF #define looks interesting. It *might* be possible to crudely hack it to work by using the nippon device and the EUC-JP encoding for man pages written in UTF-8. I don't know what the coverage of EUC-JP is like compared to UTF-8, but there might be mileage there. Alternatively, ascii8 is supposed to be eight-bit clean, and might suffice... - -- ┌── dg@cowlark.com ─── http://www.cowlark.com ─────────────────── │ │ "There does not now, nor will there ever, exist a programming language in │ which it is the least bit hard to write bad programs." --- Flon's Axiom -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFGwMl4f9E0noFvlzgRAp5TAKC3gWIPYf7lUBcguf7HySWkzZk5WwCgw4I3 WPtVKwn8MquypQdtbPkl+z8= =F9pn -----END PGP SIGNATURE-----