Ghostscript is good for working with PDFs containing text; yours likely have images but no no text. Using something like pdfimages to extract the raster-images from a PDF will give you what you want, without any unwanted rescaling.

Kevin Carlson wrote, On 2010-09-24 12:37:
We receive PDF files which appear to contain scanning artifacts which
severely impact recognition. Specifically, under magnification you
can see regularly spaced "notches" and corresponding "bumps",
especially noticeable with vertical lines.

Currently I'm using Ghostscript to convert the files to TIFF for
processing, any Python-based alternatives out there? Ultimately would
like to do all cleaning and converting using Python, with "Pytesser"
to do the OCR.

--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Reply via email to