Re: Translating manpages into several idioms (gettextization)

Colin Watson Wed, 26 Mar 2025 18:01:21 -0700

On Wed, Mar 26, 2025 at 03:24:39PM -0500, G. Branden Robinson wrote:

At 2025-03-26T12:46:18+0000, Colin Watson wrote:

I'd welcome something more robust based on groff, as long as people
remember to consider both sides of the problem (extraction of msgids,
and reassembly of pages using msgstrs).


Yes, I expect that implementing member functions in "src/roff/troff/
node.cpp" to produce output upon running "groff -A pod" would be just
the first of two implementation phases.  The second would be ensuring
that the output is arranged well for interpretation by po4a.

I'm attaching (knock wood) "groff -a" output of the ncurses beep(3) man
page (because it's short but has enough content to illustrate practical
properties of interest) and a hand-made mock-up of envisioned "groff -A
pod" output.

I still very much don't understand how po4a-translate would work withthis sort of approach. My understanding is that the only way that youcould take a preprocessed version of the document, feed it into po4a,and expect to get useful results out of the po4a-translate stage wouldbe if you could round-trip from your preprocessed form back to somethingclosely resembling the original document - and round-tripping entirepages through POD (rather than just the translatable bits) seems like anunnecessarily hard problem to solve, and probably not viable for a largecorpus.

Otherwise, po4a needs to be able to work its way through the document,translate each translatable chunk, but crucially also interpolateanything untranslatable between those chunks verbatim. It's not goingto be able to do that if it can only see something at one remove fromthe original document, or if it can get a preprocessed version but can'tmatch it up accurately to the corresponding regions of the originaldocument.

Could we instead have some kind of notation that interleaves text thatshould be left untranslated with text that should be translated, thelatter of which could be processed in some way similar to your notation?For example, if it were a stream of JSON, then you might have somethinglike this:


  [
    {"type": "raw", "data": ".TH curs_beep 3X 2025-02-01 \""},
    {"type": "msgid", "data": "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@"},
    {"type": "raw", "data": "\" \""},
    {"type": "msgid", "data": "Library calls"},
    {"type": "raw", "data": "\n.SH "},
    {"type": "msgid", "data": "NAME"},
    {"type": "raw", "data": "\n"},
    {"type": "msgid", "data": "B<beep>, B<flash> - ring the (visual) bell of the terminal 
with I<curses>"},
    ...
  ]

(Yes, this is probably terrible in many ways. Don't get hung up on thedetails - this is just to illustrate the concept.)

I presume there'd then be a po4a module to interface with this, let'ssay Locale::Po4a::Groff; on the way in it would call groff to emit someformat like this, and on the way out it would translate all the itemsthat are marked translatable and then just concatenate the results.The result would be valid groff input.

In practice my experience has been that one sometimes wants to makeslight tweaks to what po4a thinks is translatable (see e.g.https://gitlab.com/man-db/man-db/-/blob/main/man/po4a/Locale/Po4a/Manext.pm),and you probably don't want to be taking all those decisions in groffanyway. So rather than "type": "msgid", perhaps the format shouldprovide a bit more contextual information about what you're "inside"(e.g. knowing that you're looking at a string argument to someparticular macro can be useful, as can knowing that you're inside atable). I'm not sure exactly what this should look like - I expect thatit would be necessary to build the po4a side at the same time beforecommitting to an interface, as it'd be quite easy to end up withsomething that isn't actually usable.

Is that helpful? I realize that preserving fragments of the originalmarkup may not actually be possible with your current implementationvision, but that's exactly why I wanted to outline the sorts of thingsthat I think are likely to be needed sooner rather than later.

Alternatively, if the output could include accurate offsets for eachtranslatable chunk, that would probably also work: Locale::Po4a::Groffcould run your new code to get a preprocessed version and then matcheverything up. I still think we'd need a richer format than just astream of lines of text though; there's the context issue I mentionedabove, but also translatable chunks don't always match up with linesvery well. For example, I'd say that this line of input:


  .TH curs_beep 3X 2025-02-01 "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@" "Library 
calls"

... should produce two msgids, "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@"and "Library calls".

5.  At this point in the formatting process, the formatter's notion of a
   font is an integer referring to a mounting position.  We don't know
   what the font "is".  The current font is also a property of the
   environment, not of nodes per se.  But: (a) we know when the font
   selection _changes_, and (b) for man page formatting I'll bet we can
   assume that fonts are mounted in traditional order: 1, 2, 3, 4 -> R,
   I, B, BI.[1]


po4a does exactly that today, FWIW.

  https://github.com/mquinson/po4a/blob/v0.73/lib/Locale/Po4a/Man.pm#L1800

6.  Text in a man page that uses special characters (trout/grout: the
   "C" command) probably doesn't need to be translated.

   One exception: as usual we'd likely special-case what "groff -a"
   renders as `<->` and `<hy>` as good old `-`, and punt (warn on and
   ignore) any other special character.

This seems a bit too simplistic. Looking at grout for man(1), forinstance, I see a bunch of "Caq" commands that correspond to where thepage source has "'". And I wouldn't be surprised to find other Ccommands in the grout for mostly-English prose; what if somebodydescribed an approach as "naïve", for instance?


Thanks,

--
Colin Watson (he/him)                              [cjwat...@debian.org]

Re: Translating manpages into several idioms (gettextization)

Reply via email to