On Thu, Mar 27, 2025 at 09:27:48PM -0500, G. Branden Robinson wrote:
At 2025-03-27T01:00:17+0000, Colin Watson wrote:
I still very much don't understand how po4a-translate would work with
this sort of approach. My understanding is that the only way that you
could take a preprocessed version of the document, feed it into po4a,
and expect to get useful results out of the po4a-translate stage would
be if you could round-trip from your preprocessed form back to
something closely resembling the original document - and
round-tripping entire pages through POD (rather than just the
translatable bits) seems like an unnecessarily hard problem to solve,
and probably not viable for a large corpus.
[snip]
Thanks a lot for your thoughtful response. I can't rebut most of your
points, in large part because I have, it now seems to me, a deficient
grasp of how po4a is used in the field.
I may have gotten carried away by Martin Quinson's enthusiastic response
to my pitch, thinking I was facing down a tangle of hemp while equipped
with a strong sword arm, a sharp blade, and a hungry eye for Asia.
To be clear, I agree this is a problem well worth solving, so I
certainly don't want to be applying stop-energy to it. Especially if it
could manage to make mdoc pages usefully translatable ...
Is that helpful? I realize that preserving fragments of the original
markup may not actually be possible with your current implementation
vision,
Yes, that's intractably hard or even computationally impossible (because
irreversible macro interpolations, et al., have already taken
place)--under the strictly confined alternative-node-output scheme I had
in mind.
It could be that the problem is still solvable with a technique similar
to that used for grohtml, combined with how I envision refactoring the
troff/grohtml relationship, and that is by pushing more "tagging" work
into the macro packages themselves. In this case, man(7) and mdoc(7),
of course.
Another suggestion: I realize that groff's idea of the current line
number etc. is not always 100% accurate at the moment. Might it be
tractable to remember enough information while processing macros to fix
that? If groff could emit accurate positional information along with
each chunk of text it emits in this sort of mode, then a po4a module
would be able to put things back together. (I suppose groff would also
need to report the position of the _end_ of the chunk of the input
stream corresponding to each chunk of emitted text.)
Or is that what you're referring to by macro tagging?
Another possibility would be to make groff actually responsible for
injecting translated strings, in sort of the way that Martin was
wondering about in
https://github.com/mquinson/po4a/issues/527#issuecomment-2366953012, by
providing it a .po file or something similar. The main difficulty I can
imagine here is that either it would need to know exactly which strings
po4a had produced as msgstrs, or po4a would need to refrain from making
any additional tweaks to msgstrs; you'd also need a way to reverse
transformations such as font changes to B<...> and the like. And I
suppose I expected this to be too much scope creep for groff.
It's also possible I'm missing something about po4a! Martin said that
some formats delegate all the parsing to an external tool, but in
digging through po4a I wasn't able to find any examples of this. Having
examples of the approaches available would be quite helpful.
And I wouldn't be surprised to find other C commands in the grout for
mostly-English prose; what if somebody described an approach as
"naïve", for instance?
_That_, I think, on the other hand, will be relatively rare. Most
people writing man(7) (or mdoc(7), for that matter), seem to be stumped
by how to input such words, so they do something non-portable or just
give up the attempt, and degrade their input to Basic Latin ("ASCII").
Maybe a more realistic example would be author names, such as one of
those found in po4a(1p). Sometimes those are just in a bare list of
names and email addresses, as in that case, and so don't really need
translation (although I'm not sure how groff would be in a position to
determine that reliably); but sometimes they appear in short sentences
crediting a contributor for something specific.
In this case, po4a(1p) went for just entering the name in question in
UTF-8 and not worrying about portability to AT&T troff, and to be honest
I expect that to be the default among (ahem) naïve authors. These days
many people just assume that UTF-8 input will work, and since it mostly
does work with modern groff, you have to know a certain amount about
troff history to even know that there's a problem here that you might
be stumped by how to solve.
--
Colin Watson (he/him) [cjwat...@debian.org]