[let me know if you're subscribed to the list or if you'd prefer not to be CCed]
[also, if you want to break any of the several subjects arising in this message into a separate thread, please feel free] Hi Mike, At 2023-03-31T07:29:16-0700, Mike Fulton wrote: > Over the last year, we have been working hard in the z/OS Open Tools > community (https://zosopentools.github.io/meta/#/) to not only port > the fundamental tools to z/OS, but also to do it completely in the > open. This is good news! Knowing that you're a software developer might also make communications easier. :) > We create one 'port' repo for each Open Source package and the repo > contains information on compiler options, dependencies, and so forth > so that anyone can (relatively easily) build the software. > We also have a special repo (meta) that has a rudimentary package > manager and build tool that we use (e.g. _zopen install_ to install > binaries, _zopen build_ to build from source, etc.). Much as with GNU/Linux distributions; this is a pleasure to hear. As a groff developer, I'm interested in minimizing the number of patches you have to carry "downstream" to support groff. I assume the change here: https://github.com/ZOSOpenTools/groffport/blob/main/patches/makevarescape.sed.patch is due to a limitation of the system's sed(1)? If the problem is the '\+' part of the pattern, I see that POSIX says that the interpretation of that is "implementation-defined", though the latest draft of Issue 8 (just out in the past 24 hours or so) says that "a future version of this standard may require "\?", "\+", and "\|" to behave as described for the ERE special characters '?', '+', and '|', respectively." (IEEE P1003.1™-202x/D3, March 2023, p. 181). A workaround would be: -s|[^ ]/\+|&\\\\:|g +s|[^ ]//*|&\\\\:|g If you also want to steal a slight improvement from groff 1.23, you can do this instead: -s|[^ ]/\+|&\\\\:|g +s|[^ ]//*|&\\\\:\\\\%|g > We have indeed moved to a 'UTF-8 first' model, which for the most part > is a 'ISO8859-1 first' model Interestingly, this meshes closely with groff's assumptions. Due to its chronological origins ca. 1990, it does not accept UTF-8 input, but it aware of UTF-8 and can produce it as output. The formatter, troff(1), accepts ISO Latin-1 input, except on systems where the C preprocessor macro "IS_EBCDIC_HOST" evaluates true; it then assumes that its input is encoded using code page 1047. I reckon you've already dealt with this if necessary, and ensured that your groff 1.22.4 build does not define that symbol. Is code page 1047 deprecated or obsolescent on z/OS? If groff dropped support for it, do you suspect any z/OS users would be inconvenienced? > and we have a special OS library that takes care of edge case > conversions to EBCDIC (and provides a couple functions that are > missing). This is also Open Source (zoslib). This really good stuff to hear about; thanks for bringing this initiative to my attention. > We have about 80 packages we are porting / have ported. Some are very > far along like gnu make and Perl with many fixes upstreamed. Some are > just barely building - htop is probably a good example of one we have > just started on. I'm glad groff is a member of the first 100! :D > I am also not sure if we want to work in UTF-8 or in ISO-8859-1. My > goal would be UTF-8 across the board, but I expect there are things we > still need to fix to get there. Our vim port seems to work well with > UTF-8 but I'll be honest that the testing of that is sparse still. My suggestion would be to back the UTF-8 horse. groff already has machinery in place for accommodating input in UTF-8 via the preconv(1) preprocessor. If there is no longer an audience for code page 1047, several aspects of groff could be simplified, and it might make the transition of GNU troff's internal type to int32_t easier. (I started down this road once before.) > With all that background, I'm wondering if 'both' is the right answer? I don't feel qualified to answer this question in general; for groff, it's a pickle because the original implementer (James Clark) used many C0 and C1 control code points for internal purposes, to encode "node types" that could be encountered internally by the formatter when processing diversions (a Unix nroff/troff feature that usually only authors of macro packages mess with). You can see these assignments in the "input.h" header file. https://git.savannah.gnu.org/cgit/groff.git/tree/src/roff/troff/input.h Use of these codes for internal purposes isn't necessarily incompatible with UTF-8 input; GNU troff already rejects them upon input, and almost none of them are meaningful for a "plain text" document that is going to achieve format control mostly via roff language features rather than control characters. Input processing could be made more sophisticated (and more stateful when reading the input byte stream to keep track of UTF-8 sequences). > Would others also find it valuable to be able to have the mathematical > angle brackets in UTF-8 be transliterated to angle brackets in > ISO8859-1? Unless you mean degradation to basic Latin less than and greater than signs, U+003C and U+003E, then I don't think there are any valid transliteration targets in ISO Latin-1. The "left-" and "right-pointing double angle quotation mark"s (U+00AB and U+00BB) are indeed visually similar but semantically pretty distinct. I don't think I'd want to impose such a fallback in general. (There are multiple ways groff users could provide fallbacks for themselves.) > If so, perhaps a 'starter fix' would be if I worked with the libiconv > folks to see if that can be added (I opened a similar question in the > libiconv channel since honestly I'm not sure the best way to fix > this). You can pursue both lines of attack independently, especially if the iconv developers have a good reason for not performing this fallback already. I'm not sure groff has a good reason for not performing this fallback. At this point I think I will tap Dave Kemper, another groff developer who has a fairly strong interest in the fallback issue. > In parallel, I think I need to understand how I could change the way I > build man so that it operates in UTF-8 mode. I think that is a good idea. It looks like your man is man-db, which is really good news because that's developed by Colin Watson who has also been groff's package maintainer for Debian for a long time. Probably the first thing to do is make sure we know what groff is producing in your environment. Here is how to (mostly) bypass man(1) and render the groff(1) man page much as man(1) itself would do. $ zcat $(man -w groff) | groff -man -Tutf8 | less -R (If less(1) is not available, try "more", "more -b", or this: $ zcat $(man -w groff) | groff -man -Tutf8 -P -c | ul | more FYI: The version of "more" on my Debian system breaks lines at incorrect places when given the above.) Here, we are using man(1) only as a librarian, to tell us where the groff(1) man page is. We are directing formatting ourselves. If this looks fine and you get the angle brackets you're expecting, then something is running in the pipeline man-db man(1) constructs, _after_ grotty(1) produces the output, and doing violence to the angle brackets; that would be where the bug lies. To cut out yet another source of trouble, if your terminal emulator has more than 765 lines of scrollback buffer, you can omit paging the groff(1) document entirely. But if it _doesn't_ look fine, then we need to find out why. I would next inspect groff's device-independent output (which I call "grout" for short) to see what's being handed to groff's terminal output driver (grotty(1)). $ zcat $(man -w groff) | groff -man -Tutf8 | less Around line 459 you should see a sequence of lines like this. tGNU wh24 Cla h24 thttp://www.gnu.org Cra h24 t. Those "Cla" and "Cra" lines are key. If they are not absent, then you have almost certainly found a bug in groff. Another thing I would do is to view the groff_char(7) man page. $ man groff_char On my system, code point coverage is complete except for three characters. troff: <standard input>:1051: warning: can't find special character 'bs' troff: <standard input>:1192: warning: can't find special character 'radicalex' troff: <standard input>:1195: warning: can't find special character 'sqrtex' These problems are expected everywhere[1] for historical and technical reasons I won't get into unless asked. Let me know what you find and we'll see if we can narrow this down. Regards, Branden [1] the first everywhere, the last two on all terminal devices
signature.asc
Description: PGP signature