Christopher Jones wrote:
>
> I have that tool. But some pdf or ps files consist not of coded text but a
> bitmapped image. For instance, pdf and ps files which I download from journal
> databases are scanned images of journal pages. ps2ascii and pdftotext will not
> extract text from these files, since there is no ascii content to extract.
>
> So my question is: is there any software out there which attempts to look at
> bitmaps and guess what the ascii would be-- something like those programs which
> read books through a scanner and try to match font characters to the image. And
> I say this question is a reach, because I know that those programs which I have
> heard about are either very expensive or very innacurate.
with
pdfimages -f 1 file.pdf DirForTheImages
extract all images in the pdf-file. with option -j you can save them
as jpegs, otherwise by default ppm or pbm - format (a good choice).
With
pdftotext file.pdf file.txt
convert all to text.
when the pdf-file has some scanned-text, which are saved as images
you can convert these from pbm to tiff and than running an OCR
program.
Herbert
--
[EMAIL PROTECTED]
http://perce.de/lyx/