I have been following this list for some time, and I feel I might have
something to add here. Please do not hesitate to tell me to shut up if this
just makes things worse.

--------------------------

Here is a list of things that might be spaces in an otherwise
ascii-appearing document, according to the wikipedia page on non-breaking
space:

0x26, 0x23, 0x31, 0x36, 0x30, 0x3b
0x26, 0x23, 0x78, 0x41, 0x30, 0x3b
0x26, 0x6e, 0x62, 0x73, 0x70, 0x3b
0x73
0x91
0xa0
0xc2, 0xa0
0xe2, 0x80, 0x87
0xe2, 0x80, 0xaf
0xe2, 0x81, 0xa0
0xff, 0xfe

(and of course 0x20 and perhaps 0x9 might also represent space...)

The mechanisms by which these can appear are probably best left as an
exercise - there are many possible mechanisms.

Of those, the 0xc2, 0xa0 sequence is probably what you were referring to,
since that is the unicode representation of "non-breaking space". However,
it's probably worth noting that while 0xa0 is a non-breaking space in
iso/iec 8859-1, in that encoding 0xc2 is a printable character.

Anyways, I guess the point here is that if you "automatically fix" some
documents you will "automatically break" some other documents.

This is a frustrating situation, but a common one (the nice thing about
standards is that there are so many to choose from...). As a general rule,
the best solution is the one that combines simplicity of documentation with
simplicity of implementation, as well as tolerable failure modes. But
there's no one choice that's really great.

That said, personally I tend to favor a warning message for the case where
out-of-standard characters are encountered.

But I suppose it's reasonable to create a new standard. Perhaps "ascii with
certain non-breaking spaces"? There's always room for more standards?

And, with that, I am going to go back to lurking.

-- 
Raul





On Fri, Oct 24, 2014 at 5:38 PM, Liviu Daia <[email protected]> wrote:

> On 24 October 2014, Ingo Schwarze <[email protected]> wrote:
> > Hi,
> >
> > Liviu Daia wrote on Fri, Oct 24, 2014 at 08:37:31AM +0300:
> > > On 24 October 2014, Ingo Schwarze <[email protected]> wrote:
> > >> Gleydson Soares wrote on Thu, Oct 23, 2014 at 09:11:36PM -0300:
> > >>> On Thu, Oct 23, 2014 at 10:36:44AM -0300, Gonzalo L. Rodriguez wrote:
> >
> > >>>> -USE_GROFF =             Yes
> >
> > >>> mandoc conplains:
> > >>>
> > >>> $ mandoc -Tlint -Werror stunnel.8
> > >>> mandoc: stunnel.8:35:2: ERROR: skipping unknown macro: 'br\&
> > >>> mandoc: stunnel.8:85:37: ERROR: skipping bad character: 0xc2
> > >>> mandoc: stunnel.8:85:38: ERROR: skipping bad character: 0xa0
> > >>> mandoc: stunnel.8:1084:11: ERROR: skipping bad character: 0xc5
> > >>> mandoc: stunnel.8:1084:12: ERROR: skipping bad character: 0x82
> > >>> mandoc: stunnel.8:1085:16: ERROR: skipping bad character: 0xc5
> > >>> mandoc: stunnel.8:1085:17: ERROR: skipping bad character: 0x82
> > >>> $
> > >>>
> > >>> are you sure to zap groff?
> >
> > >> Yes, it's a perlpod(1) manual, and these particular errors are
> > >> harmless.
> > >>
> > >>  - 35:2 has no ill effect, actually, it's bug in mandoc(1) that
> > >>         this bogus message is shown, i will look into fixing it.
> > >>  - 85:37-38 is merely a bug in the manual,
> > >>             two stray gibberish eight bit bytes
> >
> > >     Not really, 0xC2 0xA0 is Unicode "NO-BREAK SPACE":
> > >
> > >         http://www.fileformat.info/info/unicode/char/a0/index.htm
> > >
> > >     There are probably more of these around,
> >
> > No kidding.
> >
> > > various *roff tools produce them.
> >
> > Really?  Hopefully not.  If you run into tools doing that, please
> > do report them to me.  I am willing to hunt those bugs down and
> > talk to the upstream maintainers of such broken tools.
> >
> > In the case at hand, you can claim for sure that Russ Albery's
> > pod2man(1) and David Wheeler's Pod::Simple are excessively complicated,
> > but they are not broken in this respect.  They produce correct
> > output by default.
>
>     Hello?  I was referring to non-breakable space.  I was just pointing
> out that you can expect non-breakable space characters to creep into
> man pages simply because a lot of manual pages these days are actually
> converted from other formats.
>
>     I never claimed anything about stunnel, pod2man, groff, generic
> UTF-8 characters in *roff, Heirloom, Solaris, plan9, or the translation
> of Zarathustra's collected works in Swahili. :) I just humbly pointed
> out that you'll probably stumble upon other non-breakable spaces, and
> that dealing with them (say by replacing them with normal spaces) might
> be a more energy- and time-efficient approach then posting a tirade
> against the authors of said man pages every time you run into that
> problem.
>
>     Regards,
>
>     Liviu Daia
>
> > The problem here is that the stunnel(8) maintainers don't know what
> > they are doing.  In Makefile.in, they pass the -u option (use UTF-8
> > in the generated roff(7) code) to pod2man(1), even though the manual
> > explicitly states "Many *roff implementations cannot handle non-ASCII
> > characters".  That is a massive understatement.  I do not know of
> > any implementation of roff(7) that can handle that.  Definitely
> > no version of groff or mandoc ever could, and the next future
> > releases of these two (groff-1.22.3 and mandoc-1.13.2) will not be
> > able to do it, either.  It is planned for mandoc, but work hasn't
> > started yet.  There are certainly no plans to support that in groff,
> > or i would have heard of it.  If you find *any* implementation of
> > roff(7) that can handle UTF-8 *input* without running a recoder
> > like preconv(1) first, i'd be glad to hear that.
> >
> > Now you might maybe argue that the stunnel(8) maintainers assume
> > everybody has preconv(1) available.  Strange assumption, as far as
> > i can tell, that's groff and mandoc only, and it works badly at
> > best for both of them.  And even if stunnel(8) exclusively targets
> > groff, it's not up to the job:
> >
> >    $ pod2man -u stunnel.pod | preconv -eutf8 | groff -mandoc -Tps \
> >        > stunnel.ps
> >   <standard input>:85: warning: can't find special character `u00A0'
> >
> >  ... and the resulting PostScript file has "-fdN" without a blank
> > in the SYNOPSIS line.
> >
> >    $ pod2man -u stunnel.pod | preconv -eutf8 | groff -mandoc -Tascii
> >    $ pod2man -u stunnel.pod | preconv -eutf8 | groff -mandoc -Tlatin1
> >
> > don't give you the blank, either, even though it's seemingly easy
> > enough to translate a blank to ASCII.
> >
> > By the way, even pod2man(1) itself is unable to properly handle
> > UTF-8 input.  If you do *not* give -u, there is not attempt to
> > encode non-ASCII characters into roff(7) escape sequences, they are
> > just replaced with "X" characters.  And i can't blame pod2man(1),
> > it's completely unclear what it should do.  If i remember correctly,
> > last time i looked, i found four different ways to write UTF-8
> > escape sequences in the following three roff(7) implementations:
> > groff, Heirloom/Solaris and plan9.  None of these escape syntaxes
> > worked for more than one implementation; groff has two alternative
> > syntaxes exhibiting a few very subtle, probably unintended differences
> > in the output produced.  Anything that exists is utterly non-portable.
> >
> > So the only sane way i can see for manuals of portable software is
> > to not use any kind of non-ASCII characters, but instead do ASCII
> > transliterations for author names by hand when writing the manuals,
> > and most importantly *never* use pod2man(1) -u because that breaks
> > more than just UTF-8 characters.  It also breaks spacing.
> >
> > Yes, this is a mess, and at some point, i need to attack this maze
> > of problems.  But it is complex.  Cleaning up errno handling in
> > src/lib/libc/rpc and src/lib/libc/yp is a simpler task.
> >
> > Yours,
> >   Ingo
>
>

Reply via email to