Re: Man pages and UTF-8

2007-09-16 Thread Colin Watson
On Fri, Sep 14, 2007 at 10:39:10AM +0100, Colin Watson wrote: > On Wed, Sep 12, 2007 at 02:25:26AM +0200, Adam Borowski wrote: > > On Tue, Sep 11, 2007 at 09:55:44AM +0100, Colin Watson wrote: > > > Is this what your "hack" pipeline implements? If so, I'd love to see it; > > > if not, I'm happy to

Re: Man pages and UTF-8

2007-09-14 Thread Colin Watson
On Wed, Sep 12, 2007 at 02:25:26AM +0200, Adam Borowski wrote: > On Tue, Sep 11, 2007 at 09:55:44AM +0100, Colin Watson wrote: > > > > I do need to find the stomach to look at upgrading groff again, but it's > > > > not *necessary* (or indeed sufficient) for this. The most important bit > > > > to

Re: Man pages and UTF-8

2007-09-11 Thread Adam Borowski
On Tue, Sep 11, 2007 at 09:55:44AM +0100, Colin Watson wrote: > > Woh, it's great to hear from you. I'm afraid I've been lazy too, you should > > be shown ready patches instead of hearing "that's mostly working"... > > If you do work on patches, please make sure they're against current bzr; > the

Re: Man pages and UTF-8

2007-09-11 Thread Colin Watson
[Quotes reordered slightly to suit the flow of my reply from argument to constructive suggestion. :-)] On Mon, Sep 10, 2007 at 09:56:50PM +0200, Adam Borowski wrote: > On Mon, Sep 10, 2007 at 07:03:57PM +0100, Colin Watson wrote: > > On Wed, Aug 15, 2007 at 12:50:53AM +0200, Adam Borowski wrote: >

Re: Man pages and UTF-8

2007-09-10 Thread Adam Borowski
On Mon, Sep 10, 2007 at 07:03:57PM +0100, Colin Watson wrote: > On Wed, Aug 15, 2007 at 12:50:53AM +0200, Adam Borowski wrote: > > (Colin, CC-ing you as I'm not sure if you're of aware of this long thread, > > and both man-db and groff are your territory...) > > I wasn't aware of it, thanks. Sorry

Re: Man pages and UTF-8

2007-09-10 Thread Colin Watson
On Wed, Aug 15, 2007 at 12:50:53AM +0200, Adam Borowski wrote: > (Colin, CC-ing you as I'm not sure if you're of aware of this long thread, > and both man-db and groff are your territory...) I wasn't aware of it, thanks. Sorry for my delay in responding. I read through the thread and there are a

Re: Man pages and UTF-8

2007-08-14 Thread David Given
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Adam Borowski wrote: [...] > Due to Red Hat and probably other dists using UTF-8 already, plenty of man > pages are in UTF-8 when our groff still can't parse them. Having gone > through 2/3 of the archive, I got 807 such pages so far. And every singl

Re: Man pages and UTF-8

2007-08-14 Thread Adam Borowski
On Tue, Aug 14, 2007 at 04:13:17PM -0700, Russ Allbery wrote: > Adam Borowski <[EMAIL PROTECTED]> writes: > > > Any such description file would work only as long as you hard-code any > > fonts, and somehow provide them for any potential reader. Without this, > > wcwidth() is as good as you can ge

Re: Man pages and UTF-8

2007-08-14 Thread Russ Allbery
Adam Borowski <[EMAIL PROTECTED]> writes: > Any such description file would work only as long as you hard-code any > fonts, and somehow provide them for any potential reader. Without this, > wcwidth() is as good as you can get for fixed-width fonts. For > comparison, Red Hat makes a wild assumpt

Re: Man pages and UTF-8

2007-08-14 Thread Adam Borowski
(Colin, CC-ing you as I'm not sure if you're of aware of this long thread, and both man-db and groff are your territory...) On Tue, Aug 14, 2007 at 05:25:27PM +0200, Nicolas François wrote: > I proposed Colin to work on it during Debconf, but still had no time to do > it. Could you tell us if an

Re: Man pages and UTF-8

2007-08-14 Thread Nicolas François
Hello, I proposed Colin to work on it during Debconf, but still had no time to do it. Interested peoples should read #196762 I tested a CVS snapshot of groff, and now it supports UTF-8 inputs (thanks to the preconv preprocessor) without patches. There is at least one remaining issue, which is th

Re: Man pages and UTF-8

2007-08-13 Thread Russ Allbery
David Given <[EMAIL PROTECTED]> writes: > Wll... unfortunately man-db uses ISO-8859-1 for C and POSIX locales, > so transcoding would be required. You do get lintian warnings if you try to use ISO 8859-1 characters in man pages currently. Unfortunately, a lot of people just ignore those warn

Re: Man pages and UTF-8

2007-08-13 Thread David Given
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Russ Allbery wrote: [...] > Okay, your analysis matches what I thought was going on. However, David > Given seems to be seeing something else where some man pages are already > encoded in UTF-8. So I guess I'm confused as to what's going on and what

Re: Man pages and UTF-8

2007-08-13 Thread Russ Allbery
Adam Borowski <[EMAIL PROTECTED]> writes: > The current Debian groff can produce UTF-8 output only for a narrow > range of characters, ones which happen to be present in 8 bit charsets. > It cannot handle UTF-8 input at all; on the other hand, Red Hat's > version seem to be working just fine. Yea

Re: Man pages and UTF-8

2007-08-13 Thread David Given
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ben Finney wrote: [...] >> The standard encoding for Japanese man pages is EUC-JP > > That's no more true than "the standard encoding for English text is > ASCII". The world is moving to Unicode encodings, though legacy > encodings will remain for som

Re: Man pages and UTF-8

2007-08-13 Thread Ben Finney
David Given <[EMAIL PROTECTED]> writes: > The standard encoding for Japanese man pages is EUC-JP That's no more true than "the standard encoding for English text is ASCII". The world is moving to Unicode encodings, though legacy encodings will remain for some time. They're also both equally irre

Re: Man pages and UTF-8

2007-08-13 Thread Adam Borowski
On Sun, Aug 12, 2007 at 08:44:13PM -0700, Russ Allbery wrote: > Adam Borowski <[EMAIL PROTECTED]> writes: > > > Issues to fix > > = > > > A. man output > > B. groff processing > > C. man input > > > Fixes for A. and B. are mostly local to "man-db", fixing C. would be a > > Debian-wid

Re: Man pages and UTF-8

2007-08-13 Thread David Given
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Russ Allbery wrote: [...] > What I was trying to get at earlier is that I believe groff can't handle > UTF-8 input. So fixing B, if I'm correct, is certainly not local to > man-db. I believe that fixing groff to handle multibyte character sets > prop

Re: Man pages and UTF-8

2007-08-12 Thread Russ Allbery
Adam Borowski <[EMAIL PROTECTED]> writes: > Issues to fix > = > A. man output > B. groff processing > C. man input > Fixes for A. and B. are mostly local to "man-db", fixing C. would be a > Debian-wide issue. What I was trying to get at earlier is that I believe groff can't handle U

Re: Man pages and UTF-8

2007-08-12 Thread Adam Borowski
On Sun, Aug 12, 2007 at 08:12:24PM +0900, Osamu Aoki wrote: > On Sun, Aug 12, 2007 at 08:09:06PM +1000, Ben Finney wrote: > > There's an important difference between "beat the program with a large > > cluestick" and "beat the person with a large cluestick". Adam's > > assertion was only that the fo

Re: Man pages and UTF-8

2007-08-12 Thread Osamu Aoki
On Sun, Aug 12, 2007 at 08:09:06PM +1000, Ben Finney wrote: > Osamu Aoki <[EMAIL PROTECTED]> writes: > > > On Fri, Aug 10, 2007 at 01:23:02PM +0200, Adam Borowski wrote: > > > All data files should be in UTF-8 [...] you cannot inflict data > > > loss on others. If man-db does this, it needs to b

Re: Man pages and UTF-8

2007-08-12 Thread Ben Finney
Osamu Aoki <[EMAIL PROTECTED]> writes: > On Fri, Aug 10, 2007 at 01:23:02PM +0200, Adam Borowski wrote: > > All data files should be in UTF-8 [...] you cannot inflict data > > loss on others. If man-db does this, it needs to be beaten with a > > large cluestick. > > I think the maintainer of ma

Re: Man pages and UTF-8

2007-08-11 Thread Osamu Aoki
On Sat, Aug 11, 2007 at 05:08:53PM +1000, Ben Finney wrote: > Felipe Sateler <[EMAIL PROTECTED]> writes: > > > Ben Finney wrote: > > > > > (Assuming, of course, that I'm correct in saying lenny is supposed to > > > support UTF-8 throughout, and that failure to do so is a bug.) > > > > Can somebo

Re: Man pages and UTF-8

2007-08-11 Thread Osamu Aoki
Hi, On Fri, Aug 10, 2007 at 01:23:02PM +0200, Adam Borowski wrote: > On Fri, Aug 10, 2007 at 11:24:08AM +0100, David Given wrote: > > Ben Finney wrote: > > [...] > > > That sounds like a bug. I was under the impression that the default > > > encoding of everything in lenny was supposed to be UTF-8

Re: Man pages and UTF-8

2007-08-11 Thread David Given
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Charles Plessy wrote: >[...] > You can refer to the following post of Noridata Kobayashi on this list: > http://lists.debian.org/debian-mentors/2007/03/msg00378.html > > Apparently, the encoding of the manpages is hardcoded, so you have no > other cho

Re: Man pages and UTF-8

2007-08-11 Thread Ben Finney
Felipe Sateler <[EMAIL PROTECTED]> writes: > Ben Finney wrote: > > > (Assuming, of course, that I'm correct in saying lenny is supposed to > > support UTF-8 throughout, and that failure to do so is a bug.) > > Can somebody else confirm this? A few days ago I saw > that /usr/share/dict/{wspanish,

Re: Man pages and UTF-8

2007-08-10 Thread Felipe Sateler
Ben Finney wrote: > (Assuming, of course, that I'm correct in saying lenny is supposed to > support UTF-8 throughout, and that failure to do so is a bug.) Can somebody else confirm this? A few days ago I saw that /usr/share/dict/{wspanish,wamerican} are ISO-8859 (which means they can't be cat'ted

Re: Man pages and UTF-8

2007-08-10 Thread Russ Allbery
Ben Finney <[EMAIL PROTECTED]> writes: > Okay. That doesn't mean it's not a bug; it simply points out where the > bug *is*. > (Assuming, of course, that I'm correct in saying lenny is supposed to > support UTF-8 throughout, and that failure to do so is a bug.) Yes, I think you're right. It's ju

Re: Man pages and UTF-8

2007-08-10 Thread Ben Finney
Russ Allbery <[EMAIL PROTECTED]> writes: > Ben Finney <[EMAIL PROTECTED]> writes: > > > That sounds like a bug. I was under the impression that the > > default encoding of everything in lenny was supposed to be UTF-8. > > The last time I checked, this didn't work for man pages. If it does > now

Re: Man pages and UTF-8

2007-08-10 Thread Charles Plessy
Le Fri, Aug 10, 2007 at 10:54:27AM +0100, David Given a écrit : > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > The package I'm putting together has no man page and the author is Japanese, > which means I have to write one; as a courtesy, I'd like to put the kanji-form > of his name in the A

Re: Man pages and UTF-8

2007-08-10 Thread Russ Allbery
David Given <[EMAIL PROTECTED]> writes: > What, then, should I be doing? Is it legitimate to include UTF-8 in my > man page and assume that it'll be fixed (some day)? This > seems... un-Debian-like. Is there an alternative way of representing > Unicode in troff that might work better? > Of cours

Re: Man pages and UTF-8

2007-08-10 Thread David Given
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Russ Allbery wrote: [...] > The last time I checked, this didn't work for man pages. If it does now > and we can just install man pages in UTF-8, that's great, but a quick test > seems to indicate it still doesn't work even if you run groff -T utf8 in

Re: Man pages and UTF-8

2007-08-10 Thread Russ Allbery
Ben Finney <[EMAIL PROTECTED]> writes: > That sounds like a bug. I was under the impression that the default > encoding of everything in lenny was supposed to be UTF-8. The last time I checked, this didn't work for man pages. If it does now and we can just install man pages in UTF-8, that's grea

Re: Man pages and UTF-8

2007-08-10 Thread Adam Borowski
On Fri, Aug 10, 2007 at 11:24:08AM +0100, David Given wrote: > Ben Finney wrote: > [...] > > That sounds like a bug. I was under the impression that the default > > encoding of everything in lenny was supposed to be UTF-8. > > > > What tool is it that has this different default encoding? > > Well

Re: Man pages and UTF-8

2007-08-10 Thread David Given
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ben Finney wrote: [...] > That sounds like a bug. I was under the impression that the default > encoding of everything in lenny was supposed to be UTF-8. > > What tool is it that has this different default encoding? Well, I tried UTF-8 with the assum

Re: Man pages and UTF-8

2007-08-10 Thread Ben Finney
David Given <[EMAIL PROTECTED]> writes: > Unfortunately, it would appear that using kanji characters in a man > page is non-trivial as the encoding for English man pages seems to > default to ISO-8859-1. That sounds like a bug. I was under the impression that the default encoding of everything in