On Wed, Mar 26, 2025 at 03:24:39PM -0500, G. Branden Robinson wrote:
At 2025-03-26T12:46:18+0000, Colin Watson wrote:
I'd welcome something more robust based on groff, as long as people
remember to consider both sides of the problem (extraction of msgids,
and reassembly of pages using msgstrs).
Yes, I expect that implementing member functions in "src/roff/troff/
node.cpp" to produce output upon running "groff -A pod" would be just
the first of two implementation phases. The second would be ensuring
that the output is arranged well for interpretation by po4a.
I'm attaching (knock wood) "groff -a" output of the ncurses beep(3) man
page (because it's short but has enough content to illustrate practical
properties of interest) and a hand-made mock-up of envisioned "groff -A
pod" output.
I still very much don't understand how po4a-translate would work with
this sort of approach. My understanding is that the only way that you
could take a preprocessed version of the document, feed it into po4a,
and expect to get useful results out of the po4a-translate stage would
be if you could round-trip from your preprocessed form back to something
closely resembling the original document - and round-tripping entire
pages through POD (rather than just the translatable bits) seems like an
unnecessarily hard problem to solve, and probably not viable for a large
corpus.
Otherwise, po4a needs to be able to work its way through the document,
translate each translatable chunk, but crucially also interpolate
anything untranslatable between those chunks verbatim. It's not going
to be able to do that if it can only see something at one remove from
the original document, or if it can get a preprocessed version but can't
match it up accurately to the corresponding regions of the original
document.
Could we instead have some kind of notation that interleaves text that
should be left untranslated with text that should be translated, the
latter of which could be processed in some way similar to your notation?
For example, if it were a stream of JSON, then you might have something
like this:
[
{"type": "raw", "data": ".TH curs_beep 3X 2025-02-01 \""},
{"type": "msgid", "data": "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@"},
{"type": "raw", "data": "\" \""},
{"type": "msgid", "data": "Library calls"},
{"type": "raw", "data": "\n.SH "},
{"type": "msgid", "data": "NAME"},
{"type": "raw", "data": "\n"},
{"type": "msgid", "data": "B<beep>, B<flash> - ring the (visual) bell of the terminal
with I<curses>"},
...
]
(Yes, this is probably terrible in many ways. Don't get hung up on the
details - this is just to illustrate the concept.)
I presume there'd then be a po4a module to interface with this, let's
say Locale::Po4a::Groff; on the way in it would call groff to emit some
format like this, and on the way out it would translate all the items
that are marked translatable and then just concatenate the results.
The result would be valid groff input.
In practice my experience has been that one sometimes wants to make
slight tweaks to what po4a thinks is translatable (see e.g.
https://gitlab.com/man-db/man-db/-/blob/main/man/po4a/Locale/Po4a/Manext.pm),
and you probably don't want to be taking all those decisions in groff
anyway. So rather than "type": "msgid", perhaps the format should
provide a bit more contextual information about what you're "inside"
(e.g. knowing that you're looking at a string argument to some
particular macro can be useful, as can knowing that you're inside a
table). I'm not sure exactly what this should look like - I expect that
it would be necessary to build the po4a side at the same time before
committing to an interface, as it'd be quite easy to end up with
something that isn't actually usable.
Is that helpful? I realize that preserving fragments of the original
markup may not actually be possible with your current implementation
vision, but that's exactly why I wanted to outline the sorts of things
that I think are likely to be needed sooner rather than later.
Alternatively, if the output could include accurate offsets for each
translatable chunk, that would probably also work: Locale::Po4a::Groff
could run your new code to get a preprocessed version and then match
everything up. I still think we'd need a richer format than just a
stream of lines of text though; there's the context issue I mentioned
above, but also translatable chunks don't always match up with lines
very well. For example, I'd say that this line of input:
.TH curs_beep 3X 2025-02-01 "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@" "Library
calls"
... should produce two msgids, "ncurses @NCURSES_MAJOR@.@NCURSES_MINOR@"
and "Library calls".
5. At this point in the formatting process, the formatter's notion of a
font is an integer referring to a mounting position. We don't know
what the font "is". The current font is also a property of the
environment, not of nodes per se. But: (a) we know when the font
selection _changes_, and (b) for man page formatting I'll bet we can
assume that fonts are mounted in traditional order: 1, 2, 3, 4 -> R,
I, B, BI.[1]
po4a does exactly that today, FWIW.
https://github.com/mquinson/po4a/blob/v0.73/lib/Locale/Po4a/Man.pm#L1800
6. Text in a man page that uses special characters (trout/grout: the
"C" command) probably doesn't need to be translated.
One exception: as usual we'd likely special-case what "groff -a"
renders as `<->` and `<hy>` as good old `-`, and punt (warn on and
ignore) any other special character.
This seems a bit too simplistic. Looking at grout for man(1), for
instance, I see a bunch of "Caq" commands that correspond to where the
page source has "'". And I wouldn't be surprised to find other C
commands in the grout for mostly-English prose; what if somebody
described an approach as "naïve", for instance?
Thanks,
--
Colin Watson (he/him) [cjwat...@debian.org]