On Thu, Jun 26, 2008 at 05:03:23PM +0100, Peter von Kaehne wrote:
> 
> > Can anyone in the forum read Yiddish?
> 
> No,  but I would probably understand (so-so) it if it was read to me. I
> probably also could make sense if it was latin transliterated.
> 
> There is though a problem with PDFs - I know of no way of scraping a non
> ASCII PDF. usually "copy" turns up garbage.

I've had modest success with PDF files that contained scanned images
of documents by using GhostScript to print to a tif file then processing
that via the Tesseract scanner engine. It does a fairly good job, but
unfortunately the results still need quite a bit of cleaning up. Worst
problem is that Tesseract doesn't know about page formatting, it just
outputs text in whatever order it sees it. I just this week learned of
a tool called "unpaper" which can do a lot of cleanups on pages between
the cration of the image and the OCR process, but haven't yet tried it.

-- 
---- Fred Smith -- [EMAIL PROTECTED] -----------------------------
               But God demonstrates his own love for us in this: 
                         While we were still sinners, 
                              Christ died for us.
------------------------------- Romans 5:8 (niv) ------------------------------

Attachment: pgpui2ALsQZvB.pgp
Description: PGP signature

_______________________________________________
sword-devel mailing list: sword-devel@crosswire.org
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Reply via email to