rlevesque wrote: > Hi > > I am working on a program that generates various pdf files in the / > results folder. > > "scenario1.pdf" results from scenario1 > "scenario2.pdf" results from scenario2 > etc > > Once I am happy with scenario1.pdf and scenario2.pdf files, I would > like to save them in the /check folder. > > Now after having developed/modified the program to produce > scenario3.pdf, I would like to be able to re-generate > files > /results/scenario1.pdf > /results/scenario2.pdf > > and compare them with > /check/scenario1.pdf > /check/scenario2.pdf > > I tried using the md5 module to compare these files but md5 reports > differences even though the code has *not* changed at all. > > Is there a way to compare 2 pdf files generated at different time but > identical in every other respect and validate by program that the > files are identical (for all practical purposes)?
Here's a naive approach, but it may be good enough for your purpose. I've printed the same small text into 1.pdf and 2.pdf (Bad practice warning: this session is slightly doctored; I hope I haven't introduced an error) >>> a = open("1.pdf").read() >>> b = open("2.pdf").read() >>> diff = [i for i, (x, y) in enumerate(zip(a, c)) if x != y] >>> len(diff) 2 >>> diff [160, 161] >>> a[150:170] '0100724151412)\n>>\nen' >>> a[140:170] 'nDate (D:20100724151412)\n>>\nen' >>> a[130:170] ')\n/CreationDate (D:20100724151412)\n>>\nen' OK, let's ignore "lines" starting with "/CreationDate " for our custom comparison function: >>> def equal_pdf(fa, fb): ... with open(fa) as a: ... with open(fb) as b: ... for la, lb in izip_longest(a, b, fillvalue=""): ... if la != lb: ... if not la.startswith("/CreationDate "): return False ... if not lb.startswith("/CreationDate "): return False ... return True ... >>> from itertools import izip_longest >>> equal_pdf("1.pdf", "2.pdf") True Peter -- http://mail.python.org/mailman/listinfo/python-list