On Wed, Mar 26, 2025 at 03:24:39PM -0500, G. Branden Robinson wrote:
At 2025-03-26T12:46:18+0000, Colin Watson wrote:
I'd welcome something more robust based on groff, as long as people
remember to consider both sides of the problem (extraction of msgids,
and reassembly of pages using msgstrs).

Yes, I expect that implementing member functions in "src/roff/troff/
node.cpp" to produce output upon running "groff -A pod" would be just
the first of two implementation phases.  The second would be ensuring
that the output is arranged well for interpretation by po4a.

I'm attaching (knock wood) "groff -a" output of the ncurses beep(3) man
page (because it's short but has enough content to illustrate practical
properties of interest) and a hand-made mock-up of envisioned "groff -A
pod" output.

I still very much don't understand how po4a-translate would work with this sort of approach. My understanding is that the only way that you could take a preprocessed version of the document, feed it into po4a, and expect to get useful results out of the po4a-translate stage would be if you could round-trip from your preprocessed form back to something closely resembling the original document - and round-tripping entire pages through POD (rather than just the translatable bits) seems like an unnecessarily hard problem to solve, and probably not viable for a large corpus.

Otherwise, po4a needs to be able to work its way through the document, translate each translatable chunk, but crucially also interpolate anything untranslatable between those chunks verbatim. It's not going to be able to do that if it can only see something at one remove from the original document, or if it can get a preprocessed version but can't match it up accurately to the corresponding regions of the original document.

Could we instead have some kind of notation that interleaves text that should be left untranslated with text that should be translated, the latter of which could be processed in some way similar to your notation? For example, if it were a stream of JSON, then you might have something like this:

  [
    {"type": "raw", "data": ".TH curs_beep 3X 2025-02-01 \""},
    {"type": "msgid", "data": "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@"},
    {"type": "raw", "data": "\" \""},
    {"type": "msgid", "data": "Library calls"},
    {"type": "raw", "data": "\n.SH "},
    {"type": "msgid", "data": "NAME"},
    {"type": "raw", "data": "\n"},
    {"type": "msgid", "data": "B<beep>, B<flash> - ring the (visual) bell of the terminal 
with I<curses>"},
    ...
  ]

(Yes, this is probably terrible in many ways. Don't get hung up on the details - this is just to illustrate the concept.)

I presume there'd then be a po4a module to interface with this, let's say Locale::Po4a::Groff; on the way in it would call groff to emit some format like this, and on the way out it would translate all the items that are marked translatable and then just concatenate the results. The result would be valid groff input.

In practice my experience has been that one sometimes wants to make slight tweaks to what po4a thinks is translatable (see e.g. https://gitlab.com/man-db/man-db/-/blob/main/man/po4a/Locale/Po4a/Manext.pm), and you probably don't want to be taking all those decisions in groff anyway. So rather than "type": "msgid", perhaps the format should provide a bit more contextual information about what you're "inside" (e.g. knowing that you're looking at a string argument to some particular macro can be useful, as can knowing that you're inside a table). I'm not sure exactly what this should look like - I expect that it would be necessary to build the po4a side at the same time before committing to an interface, as it'd be quite easy to end up with something that isn't actually usable.

Is that helpful? I realize that preserving fragments of the original markup may not actually be possible with your current implementation vision, but that's exactly why I wanted to outline the sorts of things that I think are likely to be needed sooner rather than later.

Alternatively, if the output could include accurate offsets for each translatable chunk, that would probably also work: Locale::Po4a::Groff could run your new code to get a preprocessed version and then match everything up. I still think we'd need a richer format than just a stream of lines of text though; there's the context issue I mentioned above, but also translatable chunks don't always match up with lines very well. For example, I'd say that this line of input:

  .TH curs_beep 3X 2025-02-01 "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@" "Library 
calls"

... should produce two msgids, "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@" and "Library calls".

5.  At this point in the formatting process, the formatter's notion of a
   font is an integer referring to a mounting position.  We don't know
   what the font "is".  The current font is also a property of the
   environment, not of nodes per se.  But: (a) we know when the font
   selection _changes_, and (b) for man page formatting I'll bet we can
   assume that fonts are mounted in traditional order: 1, 2, 3, 4 -> R,
   I, B, BI.[1]

po4a does exactly that today, FWIW.

  https://github.com/mquinson/po4a/blob/v0.73/lib/Locale/Po4a/Man.pm#L1800

6.  Text in a man page that uses special characters (trout/grout: the
   "C" command) probably doesn't need to be translated.

   One exception: as usual we'd likely special-case what "groff -a"
   renders as `<->` and `<hy>` as good old `-`, and punt (warn on and
   ignore) any other special character.

This seems a bit too simplistic. Looking at grout for man(1), for instance, I see a bunch of "Caq" commands that correspond to where the page source has "'". And I wouldn't be surprised to find other C commands in the grout for mostly-English prose; what if somebody described an approach as "naïve", for instance?

Thanks,

--
Colin Watson (he/him)                              [cjwat...@debian.org]

Reply via email to