Re: 1.23 prints some strange error

G. Branden Robinson Wed, 25 Oct 2023 03:04:16 -0700

Hi Walter & Dave,

At 2023-09-11T19:45:30+0200, Walter Alejandro Iglesias wrote:
> If instead of sourcing hyphen.tr from my macros with .mso I source it
> directly from the roff document with .so those error messages
> desapear.

As Dave mentioned, this is explained by soelim(1) not being run on the
"macro sourced" file.  As a rule, I think files to be read with the
`mso` request should be in plain ASCII only.  The whole point of a macro
file suitable for general use is that it...gets used generally, which
means that documents employing a variety of input encodings might employ
it.  You therefore should use the lowest common denominator character
encoding for it: ASCII.  (Strictly, ISO 646:1991-IRV.)

That doesn't mean you have to do much more work or spend a lot of time
staring at groff_char(7) and learning the special character identifiers
for the upper half of ISO 8859-1.  You can still have your macro sourced
file in Latin-1; just run preconv over it stand-alone as a converter.

$ printf '.ds aunt la t\\355a\n' > family.mso.in
$ preconv -e latin1 family.mso.in > family.mso

Part of the preconv(1) man page is likely worth reviewing.

   iconv support
[...]
       The use of iconv means that characters in the input that encode
       invalid code points for that encoding may be dropped from the
       output stream or mapped to the Unicode replacement character
       (U+FFFD).  Compare the following examples using the input “café”
       (note the “e” with an acute accent), which due to its short
       length challenges inference of the encoding used.
              printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv
              printf 'caf\351\n' | preconv -e us-ascii
              printf 'caf\351\n' | preconv -e latin-1
       The fate of the accented “e” differs in each case.  In the first,
       uchardet fails to detect an encoding (though the library on your
       system may behave differently) and preconv falls back to the
       locale settings, where octal 351 starts an incomplete UTF‐8
       sequence and results in the Unicode replacement character.  In
       the second, it is not a representable character in the declared
       input encoding of US‐ASCII and is discarded by iconv.  In the
       last, it is correctly detected and mapped.
[...]
   Limitations
       preconv cannot perform any transformation on input that it cannot
       see.  Examples include files that are interpolated by
       preprocessors that run subsequently, including soelim(1); files
       included by troff itself through “so” and similar requests; and
       string definitions passed to troff through its -d command‐line
       option.

Maybe I should add my adminition above about macro-sourced files to this
man page.

At 2023-09-12T11:16:58+0200, Walter Alejandro Iglesias wrote:
> I cleaned up a bit the quoted text to make room for the following.  Here
> we go:
> 
>   $ uname -a
>   Linux bell 6.4.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.4.13-1 
> (2023-08-31) x86_64 GNU/Linux
>   $ groff --version | head -1
>   GNU groff version 1.23.0
>   $ mkdir test
>   $ cd test
>   $ cat << EOF > doc.tr
>   .mso list.tr
>   EOF
>   $ cat << EOF > list.tr
>   .hw a-hí
>   .hw a-ño
>   .hw ár-bol
>   .hw cu-brí-a
>   .hw e-té-re-o
>   .hw ca-mión
>   .hw ú-te-ro
>   .hw pin-güi-no
>   EOF
>   $ GROFF_TMAC_PATH=. nroff doc.tr
>   troff:./list.tr:1: error: expected ordinary or special character, got an 
> escaped '%'
>   troff:./list.tr:4: error: expected ordinary or special character, got an 
> escaped '%'

This transcript isn't as useful as it could be, because it didn't
disclose to me what character encoding was used for list.tr on the file
system.  Running the file(1) command on it and sharing that would help.

> As you see, from the UTF-8 chars used in Spanish (á, é, í, ó, ú, ü,
> ñ), groff seems to only have problems with the 'í' in particular.
> Let's try another test using preconv(1).

preconv is probably using iconv(3) on your system ("preconv --version"
will tell you).  iconv's heuristics for guessing the encoding are opaque
to groff (and to me).

> The errors remain.  Finally, I told you that changing .mso request to
> .so made the error messages disappear, that's because in my Makefile I
> run soelim(1) before.  Last test:
> 
>   $ cat << EOF > doc.tr
>   .hla es
>   .so list.tr         \" notice here I changed the request
>   Ahí, el árbol nos cubría con su sombra.
>   Un pingüino pasaba caminando por la playa.
>   EOF
>   $ preconv -e UTF-8 doc.tr | nroff | cat -s
>   troff:./list.tr:1: error: expected ordinary or special character, got an 
> escaped '%'
>   troff:./list.tr:3: error: expected ordinary or special character, got an 
> escaped '%'
>   Ahí, el árbol nos cubría con su sombra.  Un pingüino pasaba cami‐
>   nando por la playa.
>   $ soelim doc.tr | preconv -e UTF-8 | nroff | cat -s
>   Ahí, el árbol nos cubría con su sombra.  Un pingüino pasaba cami‐
>   nando por la playa.
> 
> This last command throws no error, that's because soelim(1) allows
> preconv(1) to process the list.tr file.

Right, I think that's the right strategy precisely.  You can maintain
the file you want to `mso` in version control in whatever character
encoding is comfortable for you--I'd store it as an ".in" file and have
make(1) run preconv(1) over it when constructing documents that use it.

> Anyways.  My doubt comes from the fact that so far (with groff 1.22.4
> under OpenBSD) I haven't needed to preprocess that .hw list with
> preconv,

OpenBSD is notoriously minimalistic.  You might see if `preconv
--version` there reports use of iconv...except...uh, I think revealing
that information is something I added _after_ the groff 1.22.4 release.

So here's another paragraph from preconv(1) that might explain the
behavior on OpenBSD.

   iconv support
       While preconv recognizes all of the coding tags listed above, it
       is capable on its own of interpreting only three encodings:
       Latin‐1, code page 1047, and UTF‐8.  If iconv support is
       configured at compile time and available at run time, all others
       are passed to iconv library functions, which may recognize many
       additional encoding strings.  The command “preconv -v” discloses
       whether iconv support is configured.

Unfortunately I don't know of an example of an encoding name that is a
reliable test for iconv support being absent.

> and that only the 'í' (iacute) triggers the error.

I think this might be explained by iconv(3)'s heuristic approach.

On my system, I confirmed that nothing crazy was going on with the
following experiments.

$ printf 'caf\351\n' | preconv -e latin1
.lf 1 -
caf\[u00E9]
$ printf 'la t\355a\n' | preconv -e latin1 | nroff | head -n 1
la tía
$ printf 'la t\355a\n' | nroff -K latin1 | head -n 1
la tía
$ printf 'la t\355a\n' | nroff | head -n 1
la tía

At 2023-10-05T10:45:32+0200, Walter Alejandro Iglesias wrote:
> If I feed preconv with a file already in latin1 (using UTF-8 locales
> here) ...
> 
>   $ preconv -e utf8 list_in_latin1.tr
> 
> ... *all* non ASCII characters in the output are replaced by \[uFFFD].

Yes, because the `-e` flag _describes the character encoding of the
input_.

Description
       preconv reads each file, converts its encoded characters to a
       form troff(1) can interpret, and sends the result to the standard
       output stream.
[...]
Options
[...]
       -e encoding
              Skip detection and assume encoding; see groff’s -K option.

Do not try to tell preconv the desired character encoding of the
_output_; that's not its job.  Its job is to normalize the input so that
GNU troff(1) can read it.

The character encoding of the output is inapplicable to GNU troff(1)
itself; it, like all device-independent troffs, writes an ASCII-encoded
plain text file.  An output driver like grotty(1) translates troff(1)
output into whatever is appropriate for the device, which is why groff's
terminal output devices are named things like "ascii", "latin1" and
"utf8".

At 2023-10-12T16:46:07-0500, Dave Kemper wrote:
> On 10/5/23, Walter Alejandro Iglesias <w...@roquesor.com> wrote:
> > If I feed preconv with a file already in latin1 (using UTF-8 locales
> > here) ...
> >
> >   $ preconv -e utf8 list_in_latin1.tr
> >
> > ... *all* non ASCII characters in the output are replaced by \[uFFFD].
> 
> Yes, this would be expected to not work.  preconv's "-e" option
> specifies the *input* encoding.  So if the input file is in Latin-1,
> but you tell preconv that it's in UTF-8, you'd expect things to go
> awry.

Right.

> But that's not the full explanation: *all* Latin-1 characters are
> multiple bytes when encoded as UTF-8.

Strictly, Latin-1 is an 8-bit character encoding.  You might say here
"all characters from the Unicode Latin-1 extension block" instead.

Ya know, if you're a stickler.

> So if iacute (Latin-1 0xED) is misread in the way Bjarni describes,
> the same should happen to all the other Latin-1 characters as well.
> The fact groff is treating one Latin-1 character differently from the
> others carries the whiff of a bug.

I'm prepared to chalk this up to iconv heuristic conversion in the
absence of other information.  See my attempted reproducers above.

Regards,
Branden

signature.asc
Description: PGP signature

Re: 1.23 prints some strange error

Reply via email to