Nathan Carruth <nathan.carr...@cantab.net> writes:

>
> I doubt there is such a tool.
>
> Having spent a fair amount of time looking into this problem (in the
> context of diffing PDF versions of mathematics papers/interviews), my
> take is that such a tool would require running a tree-based diffing
> algorithm on the internal structure of the PDFs. Even then, the
> complexity of the PDF format makes any general comparison very tricky.
>
> More pragmatically, in my experience diffing PDFs also runs into issues
> with the page-based structure of PDF. For example, suppose I have
> versions v1 and v2, and v2 adds a line in the middle of p. 1. Then the
> last line of v1p1 becomes the first line of v2p2, etc., and (almost)
> _every succeeding page_ of the file lists two different lines, one at
> the top and one at the bottom. The more that is added, the worse it
> gets. The only way I can see around this would be to internally reflow
> the body text -- which might require heuristics to strip headers and
> footers -- into an unpaginated format before computing the difference.
>

Some of these problems could probably be mitigated by some solution
using tagged PDFs, so one can compare structural units other than pages.
But I guess that would require large changes to the PDF engine in mandoc.

Cf. https://pandoc.org/MANUAL.html#accessible-pdfs-and-pdf-archiving-standards

Best
-- 
Karl Pettersson
Uppsala, Sverige/Sweden

https://static-dust.klpn.se/

Reply via email to