Bug#517074: manpages: state encoding of iso-8859-* pages

Colin Watson Wed, 11 Mar 2009 03:23:02 -0700

On Wed, Mar 11, 2009 at 01:40:41PM +1300, Michael Kerrisk wrote:
> (I'm the upstream manpages maintainer, and I'm going to defer totally
> to your judgement on what needs to be done here.)


:-)

> In this report I see:
> 
> [[
> > * Better solutions *
> >
> > In a second step, I tried to move the page iso_8859-* to a directory
> > whose name tells what the encoding is (I typically move the
> > iso_8859-15 page to a directory named "en.ISO8859-15/man7"). The pipeline
> > seems to become better as we now obtain:
> 
> This is one approach, but a cleaner one would be to change the first
> line of iso-8859-15.7.gz to:
> 
>   '\" t -*- coding: ISO-8859-15 -*-
> ]]
> 
> It looks like this is the only piece that applies for the man-pages
> maintainer, right?

That's correct.

There is one small downside: versions of man-db before 2.4.4 will
misparse this because they don't know to stop at the first space after
the "t", and spew some error messages but otherwise behave correctly.
This was released in 2007, though, and I don't think that there will be
any distributions that take the new manpages but don't take the new
man-db. (The old version of the alternative 'man' package that I have
lying around from 2001 doesn't suffer from this problem, so I think it
must be OK.)

> Am I correct to assume that there should be analogous changes in all
> of the other iso_8859-*.7 pages, so that each such page specifies its
> specific locale at the top?

Yes.

For your reference, here's a potted summary of the rules that reasonably
recent versions of man-db apply to manual page encoding. (man is much
less tolerant here; as I understand it, it basically has to be
configured to expect a single encoding for any given directory tree and
can't really cope with anything more complicated, so if I were you I'd
leave that problem to distributions shipping it.)

  * man-db will always attempt to decode a page as UTF-8 before anything
    else, since in practice text is only going to successfully decode as
    UTF-8 if it actually is UTF-8.

  * Explicitly declared encodings are used next if available, whether
    they're explicitly declared in a preprocessor line as above, or by
    means of installing the manual page into a directory such as
    /usr/share/man/en_GB.ISO-8859-15.

  * Every manual page hierarchy has a default legacy encoding which is
    tried next, usually that in which the vast majority of historical
    pages are encoded. For English, of course, that's ISO-8859-1.

  * Unless groff 1.20 is available, man-db effectively recodes the page
    to the legacy encoding before feeding it to groff, since older
    versions of groff can't deal with UTF-8 input. (The patches I just
    applied as a result of this bug cause man-db to only do this for
    UTF-8 pages, since recoding between legacy encodings is generally a
    mug's game.) In practice, this means that even if you encode your
    pages in UTF-8 you can only use those characters available in the
    appropriate legacy encoding; anything else will at best be
    approximated, perhaps badly.

I haven't yet been encouraging upstream maintainers to switch to
shipping manual pages in UTF-8 because I think there are still a number
of distributions that would have trouble dealing with that (although
most of the major distributions have switched, including Debian).
Declaring an explicit encoding for anything that isn't ISO-8859-1 or
UTF-8 is a good middle ground for the moment.

-- 
Colin Watson                                       [[email protected]]



-- 
To UNSUBSCRIBE, email to [email protected]
with a subject of "unsubscribe". Trouble? Contact [email protected]

Bug#517074: manpages: state encoding of iso-8859-* pages

Reply via email to