Dn(a 31.03.2012 15:59, klo wrote / napísal(a): > > I have a scanned PDF material to which I want to add hidden text layer, so > I could index the document. I used ghostscript black and white tiff output > device (tiffg4) to extract pages as tiff images, and here is example of > what they look like: > > <http://i.imgur.com/5sZSl.png> > > Processing this image with tesseract, does not give good results. > Changing ghostscript output DPI (600, 300, 150, 96) shows that image at 96 > DPI gives best result from tesseract but it's still not satisfactory. > > I then used 8-bit gray tiff output from ghostscript, instead 1-bit black > and white, and in this case at 150 DPI I got even better result then > previously with 96 DPI black and white. However still not there yet. > > Can someone suggest which filter could enhance this image so that I get > better results? I could use imagemagick, but also can use general imaging > filter from program language, so just name it if you know how. > > > TIA > It is a difficult to suggest you the best strategy if you do not provide input (pdf) and exact command how you run conversion. There are several way/tools how to convert pdf to image [1],[2]...
[1] http://virtualvoid.posterous.com/pdf-to-image-conversion-comparing-pdf-rendere [2] http://stackoverflow.com/questions/75500/best-way-to-convert-pdf-files-to-tiff-files#221341 -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en