Thanks to Karl and Ingo for the interesting and informative responses.

On Tue, Jul 22, 2025 at 05:02:42PM +0200, Ingo Schwarze wrote:
> Hello Karl, hello Nathan,
> 
> Karl Pettersson wrote on Tue, Jul 22, 2025 at 09:43:33AM +0200:
> > Nathan Carruth <nathan.carr...@cantab.net> writes:
> 
> >> I doubt there is such a tool.
> >>
> >> Having spent a fair amount of time looking into this problem (in the
> >> context of diffing PDF versions of mathematics papers/interviews), my
> >> take is that such a tool would require running a tree-based diffing
> >> algorithm on the internal structure of the PDFs.
> 
> The good news is that mandoc(1) essentially already contains the
> tree-based algorithms you are talking about: mandoc -T pdf generates
> the PDF output document from the syntax tree you are talking about.
> The bad news is that in the present context, we want to test the
> PDF formatter (i.e., test how the term_ps.c module converts the tree
> to an actual PDF file).  Testing the tree would completely miss the
> purpose of a mandoc-to-PDF test suite, and lots of tests for the
> tree-based structure are already in place.

Makes sense. My expression 'tree-based diffing on the internal
structure' was too vague, and not necessarily accurate. I give a more
concrete suggestion towards the end of this email.

> 
> >> Even then, the complexity of the PDF format makes any general
> >> comparison very tricky.
> 

> The good news is that there is absolutely no need to bother with
> even, say, 5% of the PDF standard.  Mandoc is extremely selective
> in its use of PDF features - so testing mandoc output can easily
> avoid the complexity problem you are mentioning.

I had a look at the source code for mandoc and came away extremely
impressed: hand-crafted PDF output from scratch!

> 
> >> The only way I can see around this would be to internally reflow
> >> the body text -- which might require heuristics to strip headers and
> >> footers -- into an unpaginated format before computing the difference.
> 
> Nice idea, but testing the positioning of the text on the page is
> among the key purposes of testing PDF output.  For example, a paragraph
> should not suddenly jump right or down by five inches.  *That's*
> one example of a regression such a test suite should catch, and
> reflowing the text would likely lose that information.

This makes sense also.

Let me know if I'm just adding noise by this point, but the following
script allows for checking changes in line spacing, for example, and
could also catch some cases where multiple blank lines were inserted
incorrectly. One could also check horizontal positioning by extracting
the first element of the Td field and finding the minima, for example.

#!/bin/sh
last=0
sed -n -e '/^BT/ {
:loop
n
s/^[0-9\.]* \([0-9\.]*\) Td$/\1/p
t
s/ET//
t
bloop
} ' | { while read offset; do
        offset=$(printf "%s\n" "$offset" | tr -d '.')
        [ $last -eq 0 ] && { last=$offset; continue; }
        [ $offset -eq $last ] && continue
        printf "%d\n" "$((offset-last))"
        last=$offset
done; } | sort -u

>From what I see of mandoc's PDF output, it is a stream of text tokens
(mostly words) wrapped in BT/ET blocks which also contain positioning
and font information. One could thus consider this heuristic: (a) split
header/footer from text body (we know the margins); (b) extract the list
of text tokens and use some standard diff algorithm to check for
added/deleted tokens; (c) for tokens which are matched across the two
versions, report differences in font and position. To avoid noise, one
could report changes in position only when relative position with
respect to *both* the previous and next token changed, for example. That
should allow one to avoid seeing changes due solely to differences in
line or page breaks, and in case an entire paragraph was moved
wholesale, it would report only one change, namely for the first word.
(I am assuming the output isn't justified so word spacing is constant?)

Unfortunately I'm not in a position to offer to implement such a
thing...

> 
> > Some of these problems could probably be mitigated by some solution
> > using tagged PDFs, so one can compare structural units other than pages.
> > Cf. 
> > https://pandoc.org/MANUAL.html#accessible-pdfs-and-pdf-archiving-standards

Thanks to Karl for reminding me about tagged PDF. Using a tagged PDF
obtained from tree-structured input seems to open the way to various
further processing possibilities, at least in my (unrelated) use case.

> 
> Anyway, thanks for making me aware of PDF/UA!
> 
> Yours,
>   Ingo
> 

Nathan

Reply via email to