On Thu, Jun 26, 2008 at 05:03:23PM +0100, Peter von Kaehne wrote: > > > Can anyone in the forum read Yiddish? > > No, but I would probably understand (so-so) it if it was read to me. I > probably also could make sense if it was latin transliterated. > > There is though a problem with PDFs - I know of no way of scraping a non > ASCII PDF. usually "copy" turns up garbage.
I've had modest success with PDF files that contained scanned images of documents by using GhostScript to print to a tif file then processing that via the Tesseract scanner engine. It does a fairly good job, but unfortunately the results still need quite a bit of cleaning up. Worst problem is that Tesseract doesn't know about page formatting, it just outputs text in whatever order it sees it. I just this week learned of a tool called "unpaper" which can do a lot of cleanups on pages between the cration of the image and the OCR process, but haven't yet tried it. -- ---- Fred Smith -- [EMAIL PROTECTED] ----------------------------- But God demonstrates his own love for us in this: While we were still sinners, Christ died for us. ------------------------------- Romans 5:8 (niv) ------------------------------
pgpui2ALsQZvB.pgp
Description: PGP signature
_______________________________________________ sword-devel mailing list: sword-devel@crosswire.org http://www.crosswire.org/mailman/listinfo/sword-devel Instructions to unsubscribe/change your settings at above page