[bug #65585] [gropdf] problems introduced by commit cd9fde325f

G. Branden Robinson Fri, 12 Apr 2024 10:03:59 -0700

Update of bug #65585 (group groff):

                 Summary: Problems introduced by commit cd9fde325f => [gropdf]
problems introduced by commit cd9fde325f


    _______________________________________________________

Follow-up Comment #1:

Hi Deri,

Thanks for giving my changes a close look.  I'm obviously pretty inexperienced
with this part of the _groff_ code base.

[comment #0 original submission:]
> I have been looking at the current state of pdf.tmac and have found a few
issues. The changes which have been committed are:-
> 
> Introduction of a new flag, -S, which replaces my change which introduced
the concept of passing a single pipe character as a hotspot meant that the
actual text to form the actual hotspot would be "piped" to the document
stream, terminated by doing a markstop. This is what allowed Branden to
implement the man macros which place contents into a diversion before emitting
a hotspot. I have no objection to this change, although there were several
ways of setting up a hotspot containing a single pipe character before this
change, and most users are probably familiar with piping data. (Perhaps I
should have used "<" as the single character). I would be very annoyed if
someone sent me a pdf with a single "|" as a hotspot - would take me about 5
minutes to click it with the mouse!

I don't object to the _mnemonic_ of "|" here; what made me uneasy was the use
of in-band signaling to the macro file when faced with arbitrary user input. 
Someone somewhere would want to make a lone "|" hotspot and would, I predict,
be flabbergasted that they weren't allowed.  Yes, we could advise them to work
around the restriction by sticking a dummy character (or similar) adjacent to
it, but that doesn't feel to me like advice we could be proud of.

Also I don't think "pdf.tmac" is the right layer at which to have _groff_ warn
the user about accessibility flaws.  Maybe there isn't such a layer, apart
from excoriating the document author.

And if someone set a "|" in 36- or 48-point type, presumably it wouldn't be as
much of a problem.  So a preclusion of it (unadorned) as a hotspot seems to me
a bit of a severe measure.
 
> Another change was an attempt to only allow a restart after a preceding
suspend. This is a good idea, but it is in the wrong place, it should be
implemented in gropdf rather than pdf.tmac, since a user may use \X'pdf:
markrestart' as documented in the gropdf man page. Which bypasses the check in
pdf.tmac.

I'll need to understand this better, but I admit that it did feel to me like
only half of a fix.  My instinct is that it would be better to expose
"markstart" and "markend" directly with convenience macros, and update
_gropdf_(1) such that users could understand what they were for and how they
might write their own hyperlinking macros.

I'm pretty uncomfortable with "pdfmark.tmac"'s approach, which assumes that it
has anticipated every user's every need (but also stops documenting its
all-singing, all-dancing comprehensiveness in section 4 of its manual), and
all you need to do to use it is write a macro call resembling an _ffmpeg_
command to get results.  

There's even a place for that--but not, I think, via the means of going
straight to device control commands.  I think we should have a more granular
API for access to PDF features, and I think your approach in "pdf.tmac" has
taken several steps down that path already.

> Branden has implemented a looping construct to hold tags replacing an
associative array. It is not used if the user is using mom macros (it does not
work for mom documents).

I'm aware of it and it's on my to-do list.  I
[https://lists.gnu.org/archive/html/groff/2024-03/msg00022.html spoke of it on
the mailing list in March].

Long story short, I plan to get my new approach working for _mom_(7) but that
means I have to solve the danged old "node embedding in \X commands" problem
that has vexed us for a while.

> In fact it only works reliably for the man macros producing a "book" of man
pages, where it is used to differenciate between an internal link to another
page in the book or a "man:" URL to an external man page. The problem is that
the new code does not implement:-

> .pdfhref L -D name


> Which should emit the descriptive text associated with the named tag at the
time it was defined, but instead it just emits the "name". This is documented
in pdfmark.pdf, and used several times in pdfmark.ms, so it is reasonable to
assume users may be using this in their own documents if they use "pdfmom
--roff  -mspdf". So it is wrong to assume the new code is suitable for
everything except mom.

Okay.  That's a defect/implementation gap I was unaware of.  It might be a
good idea to spawn a ticket off for that and have this one depend on it.

> There is also an issue with the speed of the new code. One test file I used
went from an elapsed time of 0m3.08s to 11m31.62s. (About 670 times slower!).
This was producing a large document (LinuxManBook) from a single file, using
the command:-

> time ~/groff-git/groff/test-pdfmom  --roff -Tpdf -mandoc -petk
LinuxManBook.trf -z


> (test-pdfmom is the same as test-groff but calling the pdfmom in the build
directory). I don't know whether this slow down in groff will be acceptable to
most users,

That's freaking awful and I agree that it's unacceptable for release.  I am
wondering why Alex Colomar hasn't rung my bell about it.  Maybe he hasn't
tried _groff_ Git HEAD lately.  He did complain that it was tedious to work
with unreleased _groff_ versions, and asked for a 1.24 ASAP.  I demurred, as I
think there's still a solid chunk of work to be done--a few months' worth.

> I'm meant to be on a sabatical so I am not going to argue either way, but be
aware that is Branden's attempt to solve one issue which was solved months ago
in my branch waiting for merging. My use of stringhex was described as
"obfuscation", which is rather insulting if the intention was to imply a
deliberate attempt to obscure the purpose of my code.

That certainly was not my intention.  I meant simply that the stringhexed
string did not have obvious meaning when dumping device-independent or
approximated output (with "groff -Z" or "groff -a").  I want these output
formats to be accessible and comprehensible to users so that they feel more
competent to troubleshoot problems (and send us better-informed bug
reports!).

> The correct interpretation of its use is to protect the data from
interpretation by groff so that any byte sequence can be used as a string name
in groff. This is analagous to using base64 to protect binary data in emails.

I understand.  But by the same token, no human reads base64.  They decode it
first.  But we don't have an "unstringhex" command for performing this
transformation on "groff -Z" or "groff -a" output, and I'd suggest that we
shouldn't need one.

Admittedly, any accessibility my approach gains is going to be traded away
again for documents that don't predominantly use the Latin alphabet.  On net I
think my approach makes groff -[aZ] output more accessible for _many_ users
and is no worse for the rest.
  
> Although Branden's loopy code solves some issues with 1.23.0 it fails in a
number of ways which are dealt with successfully by the pdf.tmac in my branch
(i.e. t.trf in #64576).

Okay, I will have to take a look at that when my availability recovers. 
(Since you're not on the _groff_ list for a while, you may not know that I had
a death in the immediate family this week and, in an employment context, I'd
be on bereavement leave right now.)

> There is a bug in Branden's code. The attached file, pdf-L.trf, illustrates
the issue. According to the documentation in pdfmark.pdf, if you use .pdfhref
L with a destination name but no descriptive text, the descriptive text given
when the destination is named is used. With Branden's code, instead of using
the descriptive text it uses the name of the destination instead. The two pdfs
called pdf-L illustrate the problem.

Okay.  Thanks for supplying reproducers.  This should be looked into.

> Another bug results in entries in the array Branden loops over get
over-vritten in certain circumstances. This code illustrates the bug.

> .ig
>   groff -Tpdf -dPDF.EXPORT=1 -z pdf-M.trf
> 
> Results:-
> 
> .ds pdf:bm1.tag one
> .ds pdf:bm1.tag two
> 
> (The bm1.tag has been overwritten!!!)

Hmm, yes, that does sound incorrect.

> But,
> 
>   groff -Tpdf -dPDF.EXPORT=1 -dPRINTSTYLE=1 -z pdf-M.trf
> 
> Results:-
> 
> .ds pdf:look(one) Once upon...
> .ds pdf:look(two) This is two
> 
> ..
> .pdfbookmark -T one 1 Once upon...
> .pdfhref M -D two -E This is two


> Setting "PRINTSTYLE=1" bypasses Branden's changes, because he uses the
original code if mom macros are used. I don't understand why it is desirable
to have two separate methods.

It isn't desirable.  Supporting both was a temporary measure to avoid
regressing _mom_ documents.

> I have fixed all the problems listed above, plus a speed up of Branden's
loopy code. I have also included Keith's clever solution to polluting the pdf
flags with data which can cause the errors about illegal characters in
identifier names if the optional "--" marker not used.

I would like to review that; Keith and I seem frequently to disagree.

> There is a minor regression in this code, not caught by any of the check
tests, but when producing our pdf documents you will notice these new
warnings:-

> gropdf:man/groff.7: warning: Can't convert 'rs' to unicode
> gropdf:../contrib/mom/examples/sample_docs.mom: warning: Can't convert 'em'
to unicode


I have not seen any of these warnings in any builds I have done at any
point...oh, you mean these pop up with your new code?

> They are caused because gropdf currently has no way of knowing what the
UTF-16 for these groff characters. Previously the "em" character was handled
by a:-

> .tr \[em]-


> In pdf.tmac. I was never happy with this "fudge" and the correct handling
was part of my branch. I have changed the method slightly in view of what
Branden wrote on the list. Afmtodit does not add an extra column to the font
files it creates but adds the UTF-16 code as a comment. The groff_font man
page has been amended appropriately. The grops standard fonts have also had
the comment added, which corrects the regression.

Okay.  I acknowledge I haven't yet reached a conclusion about what means of
mapping _groff_ special characters back to Unicode seems best to me here.  We
_have_ glyphuni.cpp, and it seems an awful damn shame not to leverage it.  At
the same time, Perl/C++ interfacing doesn't seem like something we should just
leap into.

Maybe we can write a simple/stupid executable that links with _libgroff_,
dumps the "glyphuni" map to some well structured text file (or even a valid
Perl hash), and _gropdf_ could read that when it starts up.  Or even have it
embedded as part of the build.  We already have something in the source tree
that is similar; I just can't remember what it is right now.

> I am enjoying my sabbatical away from groff, but I thought I should correct
these recent changes, since they would cause issues if released, but I would
be grateful if other major changes to pdf production could wait until I am
back, fighting fit to do any QA.

My plan had been to finish merging the features you had broken out into that
HTML dependency graph, which I keep handy at all times on my tablet, call
those the groovy new things for gropdf in 1.24, and then leave it alone.

I say merging the features rather than merging the code because I'm uneasy
with merging code I don't understand.  Since you've gone on sabbatical the
FLOSS community has gotten a lesson reminding us of the hazards of such
things.  Andreas Freund even got himself interviewed on NPR (the U.S. public
broadcaster) yesterday.  His next performance review at Microsoft should be
exciting for him.

> Of course, if you find any problems with the code, please let me know
through this bug report.

If Dave could help organize the list you've presented above into a punch list
of new/old (probably mostly new) tickets upon which this one can depend, that
would be helpful to me and likely others to direct efforts in getting the bugs
squashed.

I would add that I think it's important for us to get more regression tests
for _groff_'s PDF output.  If we have them, then when I screw something up,
I'm more likely to know it.  It's better to have my own builds tell me I've
f***ed up than to not find out until you, or a user with a damaged document
(or three more days of beard growth by the time _groff_ finishes rendering),
tells me.

Once again I appreciate your close attention to the changes I made.

Regards,
Branden


    _______________________________________________________

Reply to this item at:

  <https://savannah.gnu.org/bugs/?65585>

_______________________________________________
Message sent via Savannah
https://savannah.gnu.org/

[bug #65585] [gropdf] problems introduced by commit cd9fde325f

Reply via email to