Nathan Carruth <nathan.carr...@cantab.net> writes:
> > I doubt there is such a tool. > > Having spent a fair amount of time looking into this problem (in the > context of diffing PDF versions of mathematics papers/interviews), my > take is that such a tool would require running a tree-based diffing > algorithm on the internal structure of the PDFs. Even then, the > complexity of the PDF format makes any general comparison very tricky. > > More pragmatically, in my experience diffing PDFs also runs into issues > with the page-based structure of PDF. For example, suppose I have > versions v1 and v2, and v2 adds a line in the middle of p. 1. Then the > last line of v1p1 becomes the first line of v2p2, etc., and (almost) > _every succeeding page_ of the file lists two different lines, one at > the top and one at the bottom. The more that is added, the worse it > gets. The only way I can see around this would be to internally reflow > the body text -- which might require heuristics to strip headers and > footers -- into an unpaginated format before computing the difference. > Some of these problems could probably be mitigated by some solution using tagged PDFs, so one can compare structural units other than pages. But I guess that would require large changes to the PDF engine in mandoc. Cf. https://pandoc.org/MANUAL.html#accessible-pdfs-and-pdf-archiving-standards Best -- Karl Pettersson Uppsala, Sverige/Sweden https://static-dust.klpn.se/