On Mon, Sep 03, 2007 at 05:38:10PM +0200, Giacomo A. Catenazzi wrote: > Colin Watson wrote: > >--- orig/policy.sgml > >+++ mod/policy.sgml > >@@ -8450,6 +8450,39 @@ > > be present in the future. > > </footnote> > > </p> > >+ > >+ <p> > >+ Manual pages that are installed under > >+ <file>/usr/share/man/</file><var>ll</var>, where <var>ll</var> > >+ is an ISO-639 language code, must be encoded with the usual > >+ legacy (non-UTF-8) character set for that language, as shown > >+ by: > >+ <example compact="compact"> > >+egrep -v '\.|@|UTF-8' /usr/share/i18n/SUPPORTED > >+ </example> > >+ <footnote> > >+ This is necessary because many packages have historically > >+ included manual pages encoded thus, and changing the > >+ encoding of the whole hierarchy would involve a difficult > >+ transitional period. > >+ </footnote> > >+ Manual pages that are installed under > >+ <file>/usr/share/man/</file><var>locale</var>, where > >+ <var>locale</var> is a full locale name listed in > >+ <file>/usr/share/i18n/SUPPORTED</file>, must be encoded with > >+ the character set implied by that locale. > >+ </p> > > I don't like the proposal ;-) > It is not very POSIXly and to application specific.
Of course it is application-specific; /usr/share/man is application-specific (i.e. specific to the man application). Methods of processing /usr/share/man that don't use /usr/bin/man are already broken in other ways. (man exports a number of specialised interfaces that can be used by frontends, and I'm happy to add more on request.) POSIX does not specify anything about the layout of /usr/share/man. The FHS makes an attempt, but it's horribly broken (speaking as one who has attempted to implement it), predates widespread deployment of UTF-8, and does not really help with the problem to hand anyway. > 1- > The POSIX way to specify locale is: > language[_territory][.codeset] or > [EMAIL PROTECTED] for some LC_ variables) Note that e.g. fr.UTF-8 matches this pattern, so I don't see your problem. The territory is intentionally omitted from the installation directory in my transition plan because it causes real problems. man will support full locale names under /usr/share/man, but in my transition plan I do not recommend using them because you don't typically want to make your French manual pages available only to users in France; they should be available to Belgians, French Canadians, Swiss French, and Luxembourgers as well. The standard exceptions well-known to internationalisation implementors are Chinese (zh_CN and zh_TW are different dialects and different scripts) and Portuguese (pt_PT and pt_BR are more or less different languages). > It is confusing the "legacy (non-UTF-8) character". Yes, it is, but it is current practice and I merely document it. If we were starting from scratch with the benefit of hindsight then obviously we wouldn't have done it this way. I think it's unambiguous for all languages where we actually have existing manual pages to worry about. > Every locale has a charset. So the man page should be > encoded according the right locale (in the manual PATH). My proposal (the diff, as opposed to the transition plan later in my original message) documents current practice, in which manual pages are installed in directories such as /usr/share/man/fr. "fr" is not a full locale name recognised by glibc, and does not have a defined character set in our system. Thus, we must define its character set by means of observing that historically pages installed there have been encoded in ISO-8859-1, and standardising that to prevent unsolvable encoding conflicts. In future, it absolutely makes sense to install the pages in /usr/share/man/fr.UTF-8 instead, which is where my transition plan takes us. But, for now, the only available alternatives are /usr/share/man/fr_FR.ISO-8859-1 and /usr/share/man/fr_FR.UTF-8, which (as above) have fundamental problems, and in any case are not well-supported at the moment (in man-db 2.4.*, /usr/share/man/fr_FR.UTF-8 will only be used if you are using that exact locale; in man-db 2.5.0, it will be used for users of the fr_FR (ISO-8859-1) locale as well and recoded on the fly, so that you don't have to install one manual page per possible encoding). > 2- > I've some problem with > /usr/share/i18n/SUPPORTED > Who generate this file? > IIRC our glibc has more locales. glibc ships this file. $ dpkg -S /usr/share/i18n/SUPPORTED locales: /usr/share/i18n/SUPPORTED $ apt-cache show locales | grep Source: Source: glibc > I don't find "en", "de". That's because glibc does not recognise those as valid locales. If you believe that a locale exists in our system but it is not in /usr/share/i18n/SUPPORTED, you are by definition mistaken. :-) > 3- > With the above point, I think that "en" (as example) has > a charset (from glibc), so man page should be set with > such charset. Your assumption is mistaken, I'm afraid. /usr/share/i18n/SUPPORTED is the canonical list of available locales in our system. There is no straightforward way to ask the question "what is the conventional legacy character set for <language>?" without also specifying a country, which doesn't help when trying to determine the character set of files under /usr/share/man/fr. That's why man has its own table for this. > >+ > >+ <p> > >+ At present, it is not generally possible to install a manual > >+ page encoded in UTF-8 such that it will be used in all locales > >+ for that language (for example, a page installed under > >+ <file>/usr/share/man/fr_FR.UTF-8</file> will not be used in > >+ the <tt>fr_BE.UTF-8</tt> locale). It is therefore not yet > >+ recommended to install pages encoded in UTF-8, but rather to > >+ continue using the legacy encoding.<footnote>This is expected > >+ to change as of man-db 2.5.0.</footnote> > >+ </p> > > </sect> > > > > <sect> > > If I understand correctly, this is only a transitional comment, so > I think we should forget about this, and update the policy when > the man-db/man is corrected. I'm happy to go that route too; I simply thought in the event that a policy upload was coming soon then it might be helpful to document current practice. It also gives me something to document the new policy against after man-db 2.5.0. :-) > > 2. man-db 2.5.0-1 uploaded, including support for installing pages > > in /usr/share/man/<ll>.<codeset>/ (e.g. /usr/share/man/fr.UTF-8). > > The basename of this directory is not typically a well-formed > > locale, but it is appropriate because it allows a clear > > specification of the hierarchy's encoding while applying to all > > countries using that language. > > Use locale and locale priorities as specified on POSIX, and allow full > <locale> not only a subclass. man-db permits them and will continue to do so, but as above I strongly believe that with the exception of Chinese and Portuguese it is not generally to our users' advantage to install manual pages under full locale names, unless you're lucky enough to use a language spoken in only one country. (IIRC you're in Switzerland; do you use it_CH.UTF-8? If so, you would not be well-served by pages specifying it_IT.UTF-8, in the same way that you would not be well-served by .po files specifying "it_IT" rather than just "it".) > > 3. man-db 2.5.0-1 moves into testing. > > > > 4. Packages encouraged (via debian-devel-announce) to begin using > > /usr/share/man/<ll>.UTF-8/; installation in other hierarchies will > > not be necessary as man-db will recode as needed. Packages using > > these hierarchies will be encouraged to declare Conflicts: man-db > > (<< 2.5.0-1) (or will Breaks: be allowed by that point? is either > > one just overkill?). > > I don't think we should go to UTF-8, but we should allow users to use > any good (for the language) charset. It is also a lot difficult to > change charset or upstreams. I should clarify that /usr/share/man/<ll>.UTF-8/ will be used by man for all <ll>* locales, not merely for those where the user requested UTF-8; man will recode to the appropriate character set on the fly. It is true that manual pages could be installed using any character set and would work fine, but since we will be able to standardise on UTF-8 I think we should do so, for all the same reasons that we should standardise on UTF-8 elsewhere: for one, it greatly simplifies things if you're looking at manual page source for whatever reason. Upstreams do not need to change, or at least can change at their leisure; it's trivial to recode the page to UTF-8 in debian/rules. > So I propose that manpage specify a charset (i.e. not using the defaul > local with only the language (and territory)). That is what I'm doing here. The character set named in the directory name specifies the encoding for all manual pages installed under that directory; it does not mandate that only users of that character set may use these manual pages. (I understand your confusion since this is not what is implemented in current man-db, but frankly that implementation doesn't benefit anyone.) There are other ways of specifying the encoding such as by putting them in a header in the page itself, but those are much less convenient in practice and are less efficient when implemented (since you have to decompress and open the page before you can find its encoding). > > 5. Update dh_installman to recode manual pages to UTF-8 > > automatically and install them under /usr/share/man/<ll>.UTF-8/. > > Getting the Conflicts:/Breaks: in here might be difficult, plus I'm > > not sure I'm wild about creating several thousand more arcs in our > > dependency graph. Maybe it's better just to wait for a stable > > release before changing debhelper, and not worry too much about the > > Conflicts:/Breaks: as it's not like the whole system will break as > > a result. > > change: to encode on relevant charset. BTW I think it should be done > on dynamically on "man" program. As above, you appear to have misunderstood the transition plan; man will recode dynamically. > BTW there should be only one "original" man page per language, and > this page should create the other encodings (but for very special > cases). Otherwise it should be difficult to maintain in parallel the > versions. There should be only one manual page per language, full stop. In the new world order, it should be installed under /usr/share/man/<ll>.UTF-8 and all other encodings will be generated on the fly. > > 7. Distant future: deprecate /usr/share/man/<ll>/. This will only > > be for consistency, so there's no need to rush. > > No, but in a short future: it should be a symbolic link to the right > (as defined in locale) ll.charset No, this cannot be done safely (it will create incompatibility) and is furthermore unnecessary and confusing. In any case it is not possible for a symbolic link on the filesystem to be dependent on the user's locale. This is handled in other ways. > Eventually we should discuss with glibc people about locale > definition, and how to export information to other programs (and thus > "man") I've implemented all this personally; glibc already provides all the information I need, aside from the strange question of "conventional legacy encodings" which is an extremely ambiguous and debatable request to make of glibc in any case and which is already handled in a good enough way in man. There is no need for glibc to change here. Cheers, -- Colin Watson [EMAIL PROTECTED] -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]