Hello Karl, hello Nathan, Karl Pettersson wrote on Tue, Jul 22, 2025 at 09:43:33AM +0200: > Nathan Carruth <nathan.carr...@cantab.net> writes:
>> I doubt there is such a tool. >> >> Having spent a fair amount of time looking into this problem (in the >> context of diffing PDF versions of mathematics papers/interviews), my >> take is that such a tool would require running a tree-based diffing >> algorithm on the internal structure of the PDFs. The good news is that mandoc(1) essentially already contains the tree-based algorithms you are talking about: mandoc -T pdf generates the PDF output document from the syntax tree you are talking about. The bad news is that in the present context, we want to test the PDF formatter (i.e., test how the term_ps.c module converts the tree to an actual PDF file). Testing the tree would completely miss the purpose of a mandoc-to-PDF test suite, and lots of tests for the tree-based structure are already in place. >> Even then, the complexity of the PDF format makes any general >> comparison very tricky. Indeed, the PDF standard is among the worst standards i have ever seen, let me call it a Challenger-Deep-sized kitchen sink. Even other extremely bad standards like XML/XSLT hardly come close to how bad PDF is. Modern HTML/CSS is not exactly pretty either, so the fact that it's a walk in the parc compared to PDF tells a lot about how horrifying PDF really is. The good news is that there is absolutely no need to bother with even, say, 5% of the PDF standard. Mandoc is extremely selective in its use of PDF features - so testing mandoc output can easily avoid the complexity problem you are mentioning. >> More pragmatically, in my experience diffing PDFs also runs into issues >> with the page-based structure of PDF. For example, suppose I have >> versions v1 and v2, and v2 adds a line in the middle of p. 1. Then the >> last line of v1p1 becomes the first line of v2p2, etc., and (almost) >> _every succeeding page_ of the file lists two different lines, one at >> the top and one at the bottom. The more that is added, the worse it >> gets. Actually, the same problem already exists with line breaking, similar to what you describe for page breaking. Add one word in the middle of a line that the subsequent line break will likely move. Do not even change any word, merely tweak kerning a bit, and the line break may already flip somewhere else. >> The only way I can see around this would be to internally reflow >> the body text -- which might require heuristics to strip headers and >> footers -- into an unpaginated format before computing the difference. Nice idea, but testing the positioning of the text on the page is among the key purposes of testing PDF output. For example, a paragraph should not suddenly jump right or down by five inches. *That's* one example of a regression such a test suite should catch, and reflowing the text would likely lose that information. > Some of these problems could probably be mitigated by some solution > using tagged PDFs, so one can compare structural units other than pages. > Cf. https://pandoc.org/MANUAL.html#accessible-pdfs-and-pdf-archiving-standards Ouch. I detest pandoc and conisder it garbage software of very bad quality. I wasn't aware of the concept of tagged PDF though. It looks like some real information about what that is is available here: https://pdfa.org/resource/iso-14289-pdfua/ They make you jump through hoops to even get the standard (accessibility my ass!), and i didn't look at the syntax specification yet to see whether they designed a good language, But if it delivers what it promoses on https://pdfa.org/resource/iso-14289-pdfua/ (that page is just a marketing blurb), that surely sounds enticing. https://en.wikipedia.org/wiki/PDF/UA https://en.wikipedia.org/wiki/PDF/A provides a very high-level introduction to the fundamental ideas. Be careful - searching for the topic with web search engines does not work well because the web overflows with garbage tutorials, so you get a negligible signal-to-noise ratio. > But I guess that would require large changes to the PDF engine in mandoc. I'm not convinced it would. The foundation of the PDF engine in mandoc already is a syntax tree with the vast majority of the nodes having semantic value. Exploiting that semantic information might be quite feasible when done right and quite possibly worthwhile. The reason why mandoc(1) -T html output is so much better than grohtml(1) is because mandoc uses semantics as *the* central structuring principle of both its user interface and its code. Exploiting the same strength for PDF sounds very attractive to me. Whether that will help with regress/ testing is another question though. It may turn out that the most important features for the PDF suite to test are the parts that are *not* semantic but purely presentational... Anyway, thanks for making me aware of PDF/UA! Yours, Ingo