How to go about with PDF regression

Jon Reyes Sun, 17 Feb 2013 19:03:20 -0800

Hey there, so I'm trying to create automated regression for PDFs that will use 
Selenium RC for the generation and Python for the comparison of PDFs. I will be 
using pyPdf to rename the files according to their content, ImageMagick to 
convert the PDFs to images and PIL to actually compare the PDFs page by page. 
Here's a big pain in the neck though: not all elements in the PDF will be the 
same. This means that with the dynamic parts of the PDF I should not compare 
them. Right now what I do is find the region that will be removed, paint over a 
white box on the element and then compare. The problem is I have to do this for 
any PDF that will deviate with some of the coordinates I've already gotten and 
this will probably take lots of time. Worse, I will have to setup a config 
file, xml or probably overload a class with variables as constants and bunch up 
a lot of ifs so that I could remove the proper elements in a PDF.


Has anyone done PDF regression for Python like this before and have you found a 
better way to do it?

I was thinking if there was a tool where I could open up the image and I could 
create boxes with the mouse and it will automatically generate the correct box 
coordinates then my life would be a tad easier but nope, I don't think there is 
one. 

Also, I thought that I could get the content of the PDF using pyPdf with the 
PageObject's extractText() method then just remove the parts but it turns out 
this couldn't be done and is not possible with PDFs. Too bad, if I had this in 
place I wouldn't need to worry about elements moving and getting the 
coordinates for all the PDFs.

Any ideas will be appreciated.

PS: I'm thinking of just creating the coordinates generator tool myself but I 
have zero experience with GUI programming let alone TKinter. 
-- 
http://mail.python.org/mailman/listinfo/python-list

How to go about with PDF regression

Reply via email to