ripping out EBCDIC (cp1047)/preparing for UTF-8 input

G. Branden Robinson Tue, 14 May 2024 06:54:03 -0700

Hi folks,

I've started down a road long contemplated.


https://savannah.gnu.org/bugs/index.php?65724

Per discussion with Mike Fulton of IBM over a year ago, and hearing no
contradiction from anyone in the interim, I aim to drop EBCDIC a.k.a.
code page (CCSID) 1047 support from groff 1.24.

I've changed the default startup files in groff Git to no longer load
_either_ cp1047.tmac _or_ latin1.tmac.

The localization files (fr.tmac, de.tmac, etc.) that require support for
ISO Latin-X (or KOI8-R) code points for now continue to load the
appropriate macro files ({latin[1259],koi8-r}.tmac).  But those will
probably go away sooner or later.

The idea is, for 1.24, to get everybody migrating to pure ASCII input
documents (as might be generated by preconv(1)) by the time GNU troff
sees them.  Recall that preconv is a preprocessor, and has dedicated
groff(1) flags to make it run, so people can still _maintain their
source documents_ in ISO Latin-X or KOI8-R.

But somebody who has been composing in English and mostly Basic Latin
with the occasional Latin-1 character sprinkled in will stop getting the
output they expect.

$ printf 'You are painfully na\357ve.\n' \
  | ~/groff-stable/bin/troff -z 2>&1 | grep . || echo NO OUTPUT
NO OUTPUT
$ printf 'You are painfully na\357ve.\n' \
  | ~/groff-HEAD/bin/troff -z 2>&1 | grep . || echo NO OUTPUT
/.../groff-HEAD/bin/troff:<standard input>:1: warning: character with input 
code 239 not defined

One way to check one's input documents to see if they'll have trouble is
to run file(1) on them.

$ printf 'You are painfully na\357ve.\n' >naive.latin1.txt
$ printf 'You are painfully na\\[i ad]ve.\n' >naive.ascii.txt
$ file naive.*
naive.ascii.txt:  ASCII text
naive.latin1.txt: ISO-8859 text

"ISO-8859 text" will be contraindicated for groff 1.24.

This achieved, we can further modify GNU troff to accept and expect
UTF-8 input directly for groff 1.25.

And that's, like, in the Mission Statement or something, which is now 10
years old.

The foregoing will require a NEWS item I haven't written yet.  I expect
I won't know everything it needs to say until I've finished the
outripping.

If someone strongly objects, please speak up soon, with a viable
alternative path to GNU troff's recognition of UTF-8, before I do more
radical code deletion.

Regards,
Branden

signature.asc
Description: PGP signature

ripping out EBCDIC (cp1047)/preparing for UTF-8 input

Reply via email to