On Tue, Sep 11, 2007 at 09:55:44AM +0100, Colin Watson wrote: > > Woh, it's great to hear from you. I'm afraid I've been lazy too, you should > > be shown ready patches instead of hearing "that's mostly working"... > > If you do work on patches, please make sure they're against current bzr; > there have been a lot of changes since 2.4.4.
Noted. > > > I do need to find the stomach to look at upgrading groff again, but it's > > > not *necessary* (or indeed sufficient) for this. The most important bit > > > to start with is really the changes to man-db. > > > > We do need to change them both at once. > > No, we don't. Seriously, I understand the problem and it's not > necessary. man-db can stick iconv pipes in wherever it likes and it's > all fine. When we upgrade groff at some future point we can just declare > versioned dependencies or conflicts as necessary, but it is *not* > necessary for this transition. A basic rule of release management is > that the more you decouple the easier it will be. Yet if groff cannot accept any encoding other than ISO-8859-1 with hacks for ja/ko/zh, you end with data loss for anything not representable in 8859-1. > > The meat of Red Hat changes to groff is: > > > > ISO-8859-1/"nippon" -> LC_CTYPE > > > > and then man-db converts everything into the current locale charset. > > (Point of information: Red Hat doesn't use man-db.) I didn't look that far, I didn't bother with installing a whole Red Hat system, just did: ./test-groff -man -Tutf8 <foo.7 which seems to work perfectly. After extending the upper range from uFFFF to u10FFFF it works like: http://angband.pl/deb/man/test.png > Thus what you're saying seems to be that Red Hat uses the ascii8 device, > or its equivalent (ascii8 passes through any 8-bit encoding untouched, Actually, their -Tascii8 is completely broken, they use -Tutf8 instead. > although certain characters are still reserved for internal use by groff > which is why it doesn't help with UTF-8). groff upstream has repeatedly > rejected this as typographically wrong-headed; I don't want to > perpetuate it. groff is supposed to know what the characters really are, > not just treat them as binary data. I fully agree. The multibyte patch for 1.8 (which Red Hat refers to everywhere as "the Debian patches") lets groff store characters as Unicode code points; the input/output issues are what we're trying to fix in this thread, and properties of particular characters are an orthogonal matter. > Obviously we have to cope with what we've got, so ascii8 is a necessary > evil, but it is just plain wrong to use it when we don't have to. So let's skip it? > > My own tree instead hardcodes it to UTF-8 under the hood; now it seems > > to me that it would probably be best to allow groff1.9-ish "-K > > charset", so man-db would be able to say "-K utf-8" while other users > > of groff would be unaffected (unlike Red Hat). > > None of this is immediately necessary. Leave groff alone for the moment > and the problem is simpler. iconv pipes are good enough for the time > being. When we do something better, it will be a proper upgrade of groff > converging on real UTF-8 input with proper knowledge of typographical > meanings of glyphs (as upstream are working on), not this badly-designed > hodgepodge. Isn't reading input into a string of Unicode codepoints good enough for now? It's a whole world better than operating on opaque binary strings (ascii8), and works well where RTL or combining chars support is not needed. > > Yet: > > [~/man]$ grep ^U mans.enc |wc -l > > 843 > > [~/man]$ grep ^U mans.enc |grep '\.UTF-8'|wc -l > > 21 > > > > So you would leave that 822 manpages broken. > > If the alternative is breaking the 10522 pages listed in your analysis > that are ISO-8859-* but not declared as such in their directory name, > absolutely! Yeah, breaking those 10522 pages would be outright wrong. But with a bit of temporary ugliness in the pipeline we can have both the 10522 ones in legacy charsets and the 822 prematurely transitioned working. > > My pipeline is a hack, but it transparently supports every manpage except > > the several broken ones. If we could have UTF-8 man in the policy, we would > > also get a guarantee that no false positive appears in the future. > > So, last night I was thinking about this, and wanted to propose a > compromise where we recommend in Debian policy that pages be installed > in a directory that explicitly specifies the encoding (you might not > like this, but it makes man-db's life a lot easier, it's much easier to > tell how complete the transition is, and it's what the FHS says we > should do), but for compatibility with the RPM world we transparently > accept UTF-8 manual pages installed in /usr/share/man/$LL/ anyway. So you would want to have the old ones put into /usr/share/man/ISO-8859-1/ (or man.8859_1) instead of /usr/share/man/? That would work, too. I'm opposed to spelling /usr/share/man/UTF-8/ in full on aesthethic grounds, as the point in Unicode is to forget something called "charset" which needed to be set ever existed, but it's you who decide here after all. > I do have an efficiency concern as man-db upstream, though, which is why > I hadn't just implemented this in the obvious crude way (try iconv -f > UTF-8, throw away the pipeline on error, try again). Yeah, doing the whole pipeline twice would be horrendous. > For large manual pages it's still of practical importance that the > formatting pipeline be smooth; that is, I don't want to have to scan the > whole page looking for non-UTF-8 characters before I can pass it to groff. > My ideal implementation would involve a program, let's call it "manconv", > with behaviour much like the following: > > * Reads from standard input and writes to standard output. > > * Valid options are -f ENCODING[:ENCODING...], -t ENCODING, and -c; > these are interpreted as with iconv except that -f's argument is a > colon-separated list of encodings to try, typically something like > UTF-8:ISO-8859-1. Fallback is only possible if characters can be > expected to be invalid in leading encodings. > > * The implementation would use iconv() on reasonably-sized chunks of > data (let's say 4KB). If it encounters EILSEQ or EINVAL, it will > throw away the current output buffer, fall back to the next encoding > in the list, and attempt to convert the same input buffer again. EINVAL is possible only if a sequence is cut by the end of the buffer, so it's ok. > This would have the behaviour that output is issued smoothly, and for -f > UTF-8:* the encoding is detected correctly provided that there is a > non-UTF-8 character within the first 4KB of the file. I haven't tested > this, but intuitively it seems that it should be a good compromise. Bad news: 4KB is not enough. Often, 8-bit characters are used only as (C) or in the authors list. The first offending characters are at uncompressed offsets: 33219 man3/Mail::Message::Field.3pm.gz 33226 man1/full_index.1grass.gz 36027 man1/mined.1.gz 37172 man3/Date::Pcalc.3pm.gz 39127 man1/SWISH-FAQ.1.gz 40214 man3/Event.3pm.gz 41114 man3/Class::Std.3pm.gz 42997 man3/SoQtViewer.3.gz 47367 man3/Net::SSLeay.3pm.gz 53003 man1/SWISH-CONFIG.1.gz 57955 man7/groff_mm.7.gz 59990 man3/HTML::Embperl.3pm.gz 63733 man3/Date::Calc.3pm.gz 67045 man1/pcal.1.gz (pcal) 72423 man1/spax.1.gz (star) 194227 man8/backuppc.8.gz (backuppc) So we can either: a) slurp the whole file (up to 585KB, save for wireshark-filter which is a 6MB monstrosity) b) use an ugly 190KB buffer c) bribe the backuppc maintainer to go down to 71KB d) same with pcal and star, for a round number of 64KB > Is this what your "hack" pipeline implements? If so, I'd love to see it; > if not, I'm happy to implement it. The prototype is: pipeline_command_args (p, "perl", "-CO", "-e", "use Encode;" "undef $/;" "$_=<STDIN>;" "eval{print decode('utf-8',$_,1)};" "print decode($ARGV[0],$_) if $@", page_encoding, NULL); so it's similar. "Slurp everything into core" in C is a page of code, your idea of a static buffer makes it simpler; and I'm not in a position to complain that it's another hack :p I thought about forking off to avoid a separate binary, but a separate binary could be potentially reused by someone else. For -c, glibc's //TRANSLIT or my translit[1] are always better: they drop accents/etc, and if they fail to find a valid replacement it will at least output "?" instead of silently dropping the character. [1]. http://angband.pl/svn/kbtin/trunk/translit.h, unlike glibc it intentionally doesn't do æ->ae, for flowing text that's worse but won't break pre-formatted or character cell text. And both glibc and mine are very poor substitutes to what Links can do: Links can even turn "Дебян" or "Δεβιαν" into "Debian". But that's probably an overkill here... -- 1KB // Microsoft corollary to Hanlon's razor: // Never attribute to stupidity what can be // adequately explained by malice. -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]