Hello Karl, hello Nathan,

Karl Pettersson wrote on Tue, Jul 22, 2025 at 09:43:33AM +0200:
> Nathan Carruth <nathan.carr...@cantab.net> writes:

>> I doubt there is such a tool.
>>
>> Having spent a fair amount of time looking into this problem (in the
>> context of diffing PDF versions of mathematics papers/interviews), my
>> take is that such a tool would require running a tree-based diffing
>> algorithm on the internal structure of the PDFs.

The good news is that mandoc(1) essentially already contains the
tree-based algorithms you are talking about: mandoc -T pdf generates
the PDF output document from the syntax tree you are talking about.
The bad news is that in the present context, we want to test the
PDF formatter (i.e., test how the term_ps.c module converts the tree
to an actual PDF file).  Testing the tree would completely miss the
purpose of a mandoc-to-PDF test suite, and lots of tests for the
tree-based structure are already in place.

>> Even then, the complexity of the PDF format makes any general
>> comparison very tricky.

Indeed, the PDF standard is among the worst standards i have ever
seen, let me call it a Challenger-Deep-sized kitchen sink.
Even other extremely bad standards like XML/XSLT hardly come close
to how bad PDF is.  Modern HTML/CSS is not exactly pretty either,
so the fact that it's a walk in the parc compared to PDF tells a
lot about how horrifying PDF really is.

The good news is that there is absolutely no need to bother with
even, say, 5% of the PDF standard.  Mandoc is extremely selective
in its use of PDF features - so testing mandoc output can easily
avoid the complexity problem you are mentioning.

>> More pragmatically, in my experience diffing PDFs also runs into issues
>> with the page-based structure of PDF. For example, suppose I have
>> versions v1 and v2, and v2 adds a line in the middle of p. 1. Then the
>> last line of v1p1 becomes the first line of v2p2, etc., and (almost)
>> _every succeeding page_ of the file lists two different lines, one at
>> the top and one at the bottom. The more that is added, the worse it
>> gets.

Actually, the same problem already exists with line breaking, similar
to what you describe for page breaking.  Add one word in the middle
of a line that the subsequent line break will likely move.
Do not even change any word, merely tweak kerning a bit, and the
line break may already flip somewhere else.

>> The only way I can see around this would be to internally reflow
>> the body text -- which might require heuristics to strip headers and
>> footers -- into an unpaginated format before computing the difference.

Nice idea, but testing the positioning of the text on the page is
among the key purposes of testing PDF output.  For example, a paragraph
should not suddenly jump right or down by five inches.  *That's*
one example of a regression such a test suite should catch, and
reflowing the text would likely lose that information.

> Some of these problems could probably be mitigated by some solution
> using tagged PDFs, so one can compare structural units other than pages.
> Cf. https://pandoc.org/MANUAL.html#accessible-pdfs-and-pdf-archiving-standards

Ouch.
I detest pandoc and conisder it garbage software of very bad quality.

I wasn't aware of the concept of tagged PDF though.
It looks like some real information about what that is is
available here:

  https://pdfa.org/resource/iso-14289-pdfua/

They make you jump through hoops to even get the standard
(accessibility my ass!), and i didn't look at the syntax specification
yet to see whether they designed a good language,  But if it delivers
what it promoses on https://pdfa.org/resource/iso-14289-pdfua/ (that
page is just a marketing blurb), that surely sounds enticing.

  https://en.wikipedia.org/wiki/PDF/UA
  https://en.wikipedia.org/wiki/PDF/A

provides a very high-level introduction to the fundamental ideas.

Be careful - searching for the topic with web search engines does
not work well because the web overflows with garbage tutorials, so
you get a negligible signal-to-noise ratio.

> But I guess that would require large changes to the PDF engine in mandoc.

I'm not convinced it would.  The foundation of the PDF engine in mandoc
already is a syntax tree with the vast majority of the nodes having
semantic value.  Exploiting that semantic information might be quite
feasible when done right and quite possibly worthwhile.

The reason why mandoc(1) -T html output is so much better than grohtml(1)
is because mandoc uses semantics as *the* central structuring principle
of both its user interface and its code.  Exploiting the same strength
for PDF sounds very attractive to me.

Whether that will help with regress/ testing is another question though.
It may turn out that the most important features for the PDF suite to
test are the parts that are *not* semantic but purely presentational...

Anyway, thanks for making me aware of PDF/UA!

Yours,
  Ingo

Reply via email to