Update of bug #65585 (group groff): Summary: Problems introduced by commit cd9fde325f => [gropdf] problems introduced by commit cd9fde325f
_______________________________________________________ Follow-up Comment #1: Hi Deri, Thanks for giving my changes a close look. I'm obviously pretty inexperienced with this part of the _groff_ code base. [comment #0 original submission:] > I have been looking at the current state of pdf.tmac and have found a few issues. The changes which have been committed are:- > > Introduction of a new flag, -S, which replaces my change which introduced the concept of passing a single pipe character as a hotspot meant that the actual text to form the actual hotspot would be "piped" to the document stream, terminated by doing a markstop. This is what allowed Branden to implement the man macros which place contents into a diversion before emitting a hotspot. I have no objection to this change, although there were several ways of setting up a hotspot containing a single pipe character before this change, and most users are probably familiar with piping data. (Perhaps I should have used "<" as the single character). I would be very annoyed if someone sent me a pdf with a single "|" as a hotspot - would take me about 5 minutes to click it with the mouse! I don't object to the _mnemonic_ of "|" here; what made me uneasy was the use of in-band signaling to the macro file when faced with arbitrary user input. Someone somewhere would want to make a lone "|" hotspot and would, I predict, be flabbergasted that they weren't allowed. Yes, we could advise them to work around the restriction by sticking a dummy character (or similar) adjacent to it, but that doesn't feel to me like advice we could be proud of. Also I don't think "pdf.tmac" is the right layer at which to have _groff_ warn the user about accessibility flaws. Maybe there isn't such a layer, apart from excoriating the document author. And if someone set a "|" in 36- or 48-point type, presumably it wouldn't be as much of a problem. So a preclusion of it (unadorned) as a hotspot seems to me a bit of a severe measure. > Another change was an attempt to only allow a restart after a preceding suspend. This is a good idea, but it is in the wrong place, it should be implemented in gropdf rather than pdf.tmac, since a user may use \X'pdf: markrestart' as documented in the gropdf man page. Which bypasses the check in pdf.tmac. I'll need to understand this better, but I admit that it did feel to me like only half of a fix. My instinct is that it would be better to expose "markstart" and "markend" directly with convenience macros, and update _gropdf_(1) such that users could understand what they were for and how they might write their own hyperlinking macros. I'm pretty uncomfortable with "pdfmark.tmac"'s approach, which assumes that it has anticipated every user's every need (but also stops documenting its all-singing, all-dancing comprehensiveness in section 4 of its manual), and all you need to do to use it is write a macro call resembling an _ffmpeg_ command to get results. There's even a place for that--but not, I think, via the means of going straight to device control commands. I think we should have a more granular API for access to PDF features, and I think your approach in "pdf.tmac" has taken several steps down that path already. > Branden has implemented a looping construct to hold tags replacing an associative array. It is not used if the user is using mom macros (it does not work for mom documents). I'm aware of it and it's on my to-do list. I [https://lists.gnu.org/archive/html/groff/2024-03/msg00022.html spoke of it on the mailing list in March]. Long story short, I plan to get my new approach working for _mom_(7) but that means I have to solve the danged old "node embedding in \X commands" problem that has vexed us for a while. > In fact it only works reliably for the man macros producing a "book" of man pages, where it is used to differenciate between an internal link to another page in the book or a "man:" URL to an external man page. The problem is that the new code does not implement:- > .pdfhref L -D name > Which should emit the descriptive text associated with the named tag at the time it was defined, but instead it just emits the "name". This is documented in pdfmark.pdf, and used several times in pdfmark.ms, so it is reasonable to assume users may be using this in their own documents if they use "pdfmom --roff -mspdf". So it is wrong to assume the new code is suitable for everything except mom. Okay. That's a defect/implementation gap I was unaware of. It might be a good idea to spawn a ticket off for that and have this one depend on it. > There is also an issue with the speed of the new code. One test file I used went from an elapsed time of 0m3.08s to 11m31.62s. (About 670 times slower!). This was producing a large document (LinuxManBook) from a single file, using the command:- > time ~/groff-git/groff/test-pdfmom --roff -Tpdf -mandoc -petk LinuxManBook.trf -z > (test-pdfmom is the same as test-groff but calling the pdfmom in the build directory). I don't know whether this slow down in groff will be acceptable to most users, That's freaking awful and I agree that it's unacceptable for release. I am wondering why Alex Colomar hasn't rung my bell about it. Maybe he hasn't tried _groff_ Git HEAD lately. He did complain that it was tedious to work with unreleased _groff_ versions, and asked for a 1.24 ASAP. I demurred, as I think there's still a solid chunk of work to be done--a few months' worth. > I'm meant to be on a sabatical so I am not going to argue either way, but be aware that is Branden's attempt to solve one issue which was solved months ago in my branch waiting for merging. My use of stringhex was described as "obfuscation", which is rather insulting if the intention was to imply a deliberate attempt to obscure the purpose of my code. That certainly was not my intention. I meant simply that the stringhexed string did not have obvious meaning when dumping device-independent or approximated output (with "groff -Z" or "groff -a"). I want these output formats to be accessible and comprehensible to users so that they feel more competent to troubleshoot problems (and send us better-informed bug reports!). > The correct interpretation of its use is to protect the data from interpretation by groff so that any byte sequence can be used as a string name in groff. This is analagous to using base64 to protect binary data in emails. I understand. But by the same token, no human reads base64. They decode it first. But we don't have an "unstringhex" command for performing this transformation on "groff -Z" or "groff -a" output, and I'd suggest that we shouldn't need one. Admittedly, any accessibility my approach gains is going to be traded away again for documents that don't predominantly use the Latin alphabet. On net I think my approach makes groff -[aZ] output more accessible for _many_ users and is no worse for the rest. > Although Branden's loopy code solves some issues with 1.23.0 it fails in a number of ways which are dealt with successfully by the pdf.tmac in my branch (i.e. t.trf in #64576). Okay, I will have to take a look at that when my availability recovers. (Since you're not on the _groff_ list for a while, you may not know that I had a death in the immediate family this week and, in an employment context, I'd be on bereavement leave right now.) > There is a bug in Branden's code. The attached file, pdf-L.trf, illustrates the issue. According to the documentation in pdfmark.pdf, if you use .pdfhref L with a destination name but no descriptive text, the descriptive text given when the destination is named is used. With Branden's code, instead of using the descriptive text it uses the name of the destination instead. The two pdfs called pdf-L illustrate the problem. Okay. Thanks for supplying reproducers. This should be looked into. > Another bug results in entries in the array Branden loops over get over-vritten in certain circumstances. This code illustrates the bug. > .ig > groff -Tpdf -dPDF.EXPORT=1 -z pdf-M.trf > > Results:- > > .ds pdf:bm1.tag one > .ds pdf:bm1.tag two > > (The bm1.tag has been overwritten!!!) Hmm, yes, that does sound incorrect. > But, > > groff -Tpdf -dPDF.EXPORT=1 -dPRINTSTYLE=1 -z pdf-M.trf > > Results:- > > .ds pdf:look(one) Once upon... > .ds pdf:look(two) This is two > > .. > .pdfbookmark -T one 1 Once upon... > .pdfhref M -D two -E This is two > Setting "PRINTSTYLE=1" bypasses Branden's changes, because he uses the original code if mom macros are used. I don't understand why it is desirable to have two separate methods. It isn't desirable. Supporting both was a temporary measure to avoid regressing _mom_ documents. > I have fixed all the problems listed above, plus a speed up of Branden's loopy code. I have also included Keith's clever solution to polluting the pdf flags with data which can cause the errors about illegal characters in identifier names if the optional "--" marker not used. I would like to review that; Keith and I seem frequently to disagree. > There is a minor regression in this code, not caught by any of the check tests, but when producing our pdf documents you will notice these new warnings:- > gropdf:man/groff.7: warning: Can't convert 'rs' to unicode > gropdf:../contrib/mom/examples/sample_docs.mom: warning: Can't convert 'em' to unicode I have not seen any of these warnings in any builds I have done at any point...oh, you mean these pop up with your new code? > They are caused because gropdf currently has no way of knowing what the UTF-16 for these groff characters. Previously the "em" character was handled by a:- > .tr \[em]- > In pdf.tmac. I was never happy with this "fudge" and the correct handling was part of my branch. I have changed the method slightly in view of what Branden wrote on the list. Afmtodit does not add an extra column to the font files it creates but adds the UTF-16 code as a comment. The groff_font man page has been amended appropriately. The grops standard fonts have also had the comment added, which corrects the regression. Okay. I acknowledge I haven't yet reached a conclusion about what means of mapping _groff_ special characters back to Unicode seems best to me here. We _have_ glyphuni.cpp, and it seems an awful damn shame not to leverage it. At the same time, Perl/C++ interfacing doesn't seem like something we should just leap into. Maybe we can write a simple/stupid executable that links with _libgroff_, dumps the "glyphuni" map to some well structured text file (or even a valid Perl hash), and _gropdf_ could read that when it starts up. Or even have it embedded as part of the build. We already have something in the source tree that is similar; I just can't remember what it is right now. > I am enjoying my sabbatical away from groff, but I thought I should correct these recent changes, since they would cause issues if released, but I would be grateful if other major changes to pdf production could wait until I am back, fighting fit to do any QA. My plan had been to finish merging the features you had broken out into that HTML dependency graph, which I keep handy at all times on my tablet, call those the groovy new things for gropdf in 1.24, and then leave it alone. I say merging the features rather than merging the code because I'm uneasy with merging code I don't understand. Since you've gone on sabbatical the FLOSS community has gotten a lesson reminding us of the hazards of such things. Andreas Freund even got himself interviewed on NPR (the U.S. public broadcaster) yesterday. His next performance review at Microsoft should be exciting for him. > Of course, if you find any problems with the code, please let me know through this bug report. If Dave could help organize the list you've presented above into a punch list of new/old (probably mostly new) tickets upon which this one can depend, that would be helpful to me and likely others to direct efforts in getting the bugs squashed. I would add that I think it's important for us to get more regression tests for _groff_'s PDF output. If we have them, then when I screw something up, I'm more likely to know it. It's better to have my own builds tell me I've f***ed up than to not find out until you, or a user with a damaged document (or three more days of beard growth by the time _groff_ finishes rendering), tells me. Once again I appreciate your close attention to the changes I made. Regards, Branden _______________________________________________________ Reply to this item at: <https://savannah.gnu.org/bugs/?65585> _______________________________________________ Message sent via Savannah https://savannah.gnu.org/