DrLeif <l.lensg...@gmail.com> writes: > What I would like to do is have python detect a "blank" pages in a PDF > file and remove it. Any suggestions?
The odds are good that even a blank page is being "rendered" within the PDF as having some small bits of data due to scanner resolution, imperfections on the page, etc.. So I suspect you won't be able to just look for a well-defined pattern in the resulting PDF or anything. Unless you're using OCR, the odds are good that the scanner is rendering the PDF as an embedded image. What I'd probably do is extract the image of the page, and then use image processing on it to try to identify blank pages. I haven't had the need to do this myself, and tool availability would depend on platform, but for example, I'd probably try ImageMagick's convert operation to turn the PDF into images (like PNGs). I think Gimp can also do a similar conversion, but you'd probably have to script it yourself. Once you have an image of a page, you could then use something like OpenCV to process the page (perhaps a morphology operation to remove small noise areas, then a threshold or non-zero counter to judge "blankness"), or probably just something like PIL depending on complexity of the processing needed. Once you identify a blank page, removing it could either be with pure Python (there have been other posts recently about PDF libraries) or with external tools (such as pdftk under Linux for example). -- David -- http://mail.python.org/mailman/listinfo/python-list