Re: Extracting text from scanned PDF docs

Eugene Reimer Fri, 24 Sep 2010 15:20:01 -0700

Ghostscript is good for working with PDFs containing text; yours likelyhave images but no no text. Using something like pdfimages to extractthe raster-images from a PDF will give you what you want, without anyunwanted rescaling.


Kevin Carlson wrote, On 2010-09-24 12:37:

We receive PDF files which appear to contain scanning artifacts which
severely impact recognition. Specifically, under magnification you
can see regularly spaced "notches" and corresponding "bumps",
especially noticeable with vertical lines.

Currently I'm using Ghostscript to convert the files to TIFF for
processing, any Python-based alternatives out there? Ultimately would
like to do all cleaning and converting using Python, with "Pytesser"
to do the OCR.


--
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To post to this group, send email to tesseract-...@googlegroups.com.
To unsubscribe from this group, send email to 
tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/tesseract-ocr?hl=en.

Re: Extracting text from scanned PDF docs

Reply via email to