On Fri, Jul 18, 2025 at 06:06:55PM +0200, Ingo Schwarze wrote: > Hello Jan, > > Jan Stary wrote on Fri, Jul 18, 2025 at 03:22:44PM +0200: > > > It seems the ps/pdf output wraps a line of text > > after each word. This happens with every manpage. > > Oops. Fixed with the commit appended below. > > I was so scrupulous about testing ASCII, UTF-8, and HTML output > that i totally forgot sufficiently testing PostScript and PDF. > > If anyone has an idea how to regress/ test PostScript or PDF > output, i'd be interested in hearing about it. Diffing complete > output files is not an option, and even diffing *parts* of > output files (like it is done for HTML) isn't either. > That would be over-testing because PostScript and PDF > files contain so many gory details that can change without > the output becoming wrong, and that are actually likely to > change as a result of minor code changes, so we would > constantly get massive churn in the test suite. > Also, whatever is done needs to work with tools that are > available in the base system.
I doubt there is such a tool. Having spent a fair amount of time looking into this problem (in the context of diffing PDF versions of mathematics papers/interviews), my take is that such a tool would require running a tree-based diffing algorithm on the internal structure of the PDFs. Even then, the complexity of the PDF format makes any general comparison very tricky. More pragmatically, in my experience diffing PDFs also runs into issues with the page-based structure of PDF. For example, suppose I have versions v1 and v2, and v2 adds a line in the middle of p. 1. Then the last line of v1p1 becomes the first line of v2p2, etc., and (almost) _every succeeding page_ of the file lists two different lines, one at the top and one at the bottom. The more that is added, the worse it gets. The only way I can see around this would be to internally reflow the body text -- which might require heuristics to strip headers and footers -- into an unpaginated format before computing the difference. (After writing this paragraph I remembered you mentioned using pdftotext so perhaps you already have some method for dealing with headers/footers and changes in pagination?) Nathan > > In any case, since PostScript and PDF are not really the focus > of mandoc(1) development, it is appreciated that some people > appear to keep an eye on it. Thanks. :-) > > Yours, > Ingo > > > CVSROOT: /cvs > Module name: src > Changes by: schwa...@cvs.openbsd.org 2025/07/18 09:46:58 > > Modified files: > usr.bin/mandoc : term_ps.c > > Log message: > Adjust viscol (the distance in basic units from the column offset) > and minbl (the minimum whitespace in basic units before the next column) > in ps_advance() and ps_endline() because that is what term.c now expects. > Regression reported by Jan Stary <hans at stare dot cz> on misc@. > > Also adjust ps_hspan() to the new definition of basic units > in terminal output. > > > Index: term_ps.c > =================================================================== > RCS file: /cvs/src/usr.bin/mandoc/term_ps.c,v > diff -u -p -r1.57 term_ps.c > --- term_ps.c 16 Jul 2025 14:23:55 -0000 1.57 > +++ term_ps.c 18 Jul 2025 15:44:36 -0000 > @@ -1207,6 +1207,7 @@ ps_advance(struct termp *p, size_t len) > ps_plast(p); > ps_pclose(p); > p->ps->pscol += len; > + p->viscol += len; > } > > static void > @@ -1230,6 +1231,8 @@ ps_endline(struct termp *p) > /* Left-justify. */ > > p->ps->pscol = p->ps->left; > + p->viscol = 0; > + p->minbl = 0; > > /* If we haven't printed anything, return. */ > > @@ -1307,7 +1310,7 @@ ps_hspan(const struct termp *p, const st > * scaling unit so that output is the same regardless > * the media. > */ > - r = PNT2AFM(p, su->scale * 72.0 / 240.0); > + r = PNT2AFM(p, su->scale * 72.0 / 10.0); > break; > case SCALE_CM: > r = PNT2AFM(p, su->scale * 72.0 / 2.54); > @@ -1340,8 +1343,7 @@ ps_hspan(const struct termp *p, const st > r = su->scale; > break; > } > - > - return r * 24.0; > + return r; > } > > static void >