Hi Walter & Dave, At 2023-09-11T19:45:30+0200, Walter Alejandro Iglesias wrote: > If instead of sourcing hyphen.tr from my macros with .mso I source it > directly from the roff document with .so those error messages > desapear.
As Dave mentioned, this is explained by soelim(1) not being run on the "macro sourced" file. As a rule, I think files to be read with the `mso` request should be in plain ASCII only. The whole point of a macro file suitable for general use is that it...gets used generally, which means that documents employing a variety of input encodings might employ it. You therefore should use the lowest common denominator character encoding for it: ASCII. (Strictly, ISO 646:1991-IRV.) That doesn't mean you have to do much more work or spend a lot of time staring at groff_char(7) and learning the special character identifiers for the upper half of ISO 8859-1. You can still have your macro sourced file in Latin-1; just run preconv over it stand-alone as a converter. $ printf '.ds aunt la t\\355a\n' > family.mso.in $ preconv -e latin1 family.mso.in > family.mso Part of the preconv(1) man page is likely worth reviewing. iconv support [...] The use of iconv means that characters in the input that encode invalid code points for that encoding may be dropped from the output stream or mapped to the Unicode replacement character (U+FFFD). Compare the following examples using the input “café” (note the “e” with an acute accent), which due to its short length challenges inference of the encoding used. printf 'caf\351\n' | LC_ALL=en_US.UTF-8 preconv printf 'caf\351\n' | preconv -e us-ascii printf 'caf\351\n' | preconv -e latin-1 The fate of the accented “e” differs in each case. In the first, uchardet fails to detect an encoding (though the library on your system may behave differently) and preconv falls back to the locale settings, where octal 351 starts an incomplete UTF‐8 sequence and results in the Unicode replacement character. In the second, it is not a representable character in the declared input encoding of US‐ASCII and is discarded by iconv. In the last, it is correctly detected and mapped. [...] Limitations preconv cannot perform any transformation on input that it cannot see. Examples include files that are interpolated by preprocessors that run subsequently, including soelim(1); files included by troff itself through “so” and similar requests; and string definitions passed to troff through its -d command‐line option. Maybe I should add my adminition above about macro-sourced files to this man page. At 2023-09-12T11:16:58+0200, Walter Alejandro Iglesias wrote: > I cleaned up a bit the quoted text to make room for the following. Here > we go: > > $ uname -a > Linux bell 6.4.0-4-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.4.13-1 > (2023-08-31) x86_64 GNU/Linux > $ groff --version | head -1 > GNU groff version 1.23.0 > $ mkdir test > $ cd test > $ cat << EOF > doc.tr > .mso list.tr > EOF > $ cat << EOF > list.tr > .hw a-hí > .hw a-ño > .hw ár-bol > .hw cu-brí-a > .hw e-té-re-o > .hw ca-mión > .hw ú-te-ro > .hw pin-güi-no > EOF > $ GROFF_TMAC_PATH=. nroff doc.tr > troff:./list.tr:1: error: expected ordinary or special character, got an > escaped '%' > troff:./list.tr:4: error: expected ordinary or special character, got an > escaped '%' This transcript isn't as useful as it could be, because it didn't disclose to me what character encoding was used for list.tr on the file system. Running the file(1) command on it and sharing that would help. > As you see, from the UTF-8 chars used in Spanish (á, é, í, ó, ú, ü, > ñ), groff seems to only have problems with the 'í' in particular. > Let's try another test using preconv(1). preconv is probably using iconv(3) on your system ("preconv --version" will tell you). iconv's heuristics for guessing the encoding are opaque to groff (and to me). > The errors remain. Finally, I told you that changing .mso request to > .so made the error messages disappear, that's because in my Makefile I > run soelim(1) before. Last test: > > $ cat << EOF > doc.tr > .hla es > .so list.tr \" notice here I changed the request > Ahí, el árbol nos cubría con su sombra. > Un pingüino pasaba caminando por la playa. > EOF > $ preconv -e UTF-8 doc.tr | nroff | cat -s > troff:./list.tr:1: error: expected ordinary or special character, got an > escaped '%' > troff:./list.tr:3: error: expected ordinary or special character, got an > escaped '%' > Ahí, el árbol nos cubría con su sombra. Un pingüino pasaba cami‐ > nando por la playa. > $ soelim doc.tr | preconv -e UTF-8 | nroff | cat -s > Ahí, el árbol nos cubría con su sombra. Un pingüino pasaba cami‐ > nando por la playa. > > This last command throws no error, that's because soelim(1) allows > preconv(1) to process the list.tr file. Right, I think that's the right strategy precisely. You can maintain the file you want to `mso` in version control in whatever character encoding is comfortable for you--I'd store it as an ".in" file and have make(1) run preconv(1) over it when constructing documents that use it. > Anyways. My doubt comes from the fact that so far (with groff 1.22.4 > under OpenBSD) I haven't needed to preprocess that .hw list with > preconv, OpenBSD is notoriously minimalistic. You might see if `preconv --version` there reports use of iconv...except...uh, I think revealing that information is something I added _after_ the groff 1.22.4 release. So here's another paragraph from preconv(1) that might explain the behavior on OpenBSD. iconv support While preconv recognizes all of the coding tags listed above, it is capable on its own of interpreting only three encodings: Latin‐1, code page 1047, and UTF‐8. If iconv support is configured at compile time and available at run time, all others are passed to iconv library functions, which may recognize many additional encoding strings. The command “preconv -v” discloses whether iconv support is configured. Unfortunately I don't know of an example of an encoding name that is a reliable test for iconv support being absent. > and that only the 'í' (iacute) triggers the error. I think this might be explained by iconv(3)'s heuristic approach. On my system, I confirmed that nothing crazy was going on with the following experiments. $ printf 'caf\351\n' | preconv -e latin1 .lf 1 - caf\[u00E9] $ printf 'la t\355a\n' | preconv -e latin1 | nroff | head -n 1 la tía $ printf 'la t\355a\n' | nroff -K latin1 | head -n 1 la tía $ printf 'la t\355a\n' | nroff | head -n 1 la tía At 2023-10-05T10:45:32+0200, Walter Alejandro Iglesias wrote: > If I feed preconv with a file already in latin1 (using UTF-8 locales > here) ... > > $ preconv -e utf8 list_in_latin1.tr > > ... *all* non ASCII characters in the output are replaced by \[uFFFD]. Yes, because the `-e` flag _describes the character encoding of the input_. Description preconv reads each file, converts its encoded characters to a form troff(1) can interpret, and sends the result to the standard output stream. [...] Options [...] -e encoding Skip detection and assume encoding; see groff’s -K option. Do not try to tell preconv the desired character encoding of the _output_; that's not its job. Its job is to normalize the input so that GNU troff(1) can read it. The character encoding of the output is inapplicable to GNU troff(1) itself; it, like all device-independent troffs, writes an ASCII-encoded plain text file. An output driver like grotty(1) translates troff(1) output into whatever is appropriate for the device, which is why groff's terminal output devices are named things like "ascii", "latin1" and "utf8". At 2023-10-12T16:46:07-0500, Dave Kemper wrote: > On 10/5/23, Walter Alejandro Iglesias <w...@roquesor.com> wrote: > > If I feed preconv with a file already in latin1 (using UTF-8 locales > > here) ... > > > > $ preconv -e utf8 list_in_latin1.tr > > > > ... *all* non ASCII characters in the output are replaced by \[uFFFD]. > > Yes, this would be expected to not work. preconv's "-e" option > specifies the *input* encoding. So if the input file is in Latin-1, > but you tell preconv that it's in UTF-8, you'd expect things to go > awry. Right. > But that's not the full explanation: *all* Latin-1 characters are > multiple bytes when encoded as UTF-8. Strictly, Latin-1 is an 8-bit character encoding. You might say here "all characters from the Unicode Latin-1 extension block" instead. Ya know, if you're a stickler. > So if iacute (Latin-1 0xED) is misread in the way Bjarni describes, > the same should happen to all the other Latin-1 characters as well. > The fact groff is treating one Latin-1 character differently from the > others carries the whiff of a bug. I'm prepared to chalk this up to iconv heuristic conversion in the absence of other information. See my attempted reproducers above. Regards, Branden
signature.asc
Description: PGP signature