M.B. Schiekel schrieb:
> I´m trying to establish a document server with htdig under SuSE-8.2. In
> this context I also tried to build an index of pdf-files, created with
> LyX, with the htdig external_parsers method.
> This works for all my pdf-files, except the ones from LyX-sources.
> ...
> BUGS
>        Some  PDF  files  contain  fonts whose encodings have been
>        mangled beyond recognition.  There is  no  way  (short  of
>        OCR) to extract text from these files.

After some additional research, I found, that pdftotext really has
problems with the recognition of some font encodings.
One can list the used fonts with pdffonts, also part of package xpdf.

Version 2.0.1-49 of pdftotext can not handle pdf-files, that were
generated with dvi2ps->ps2pdf.
It can handle pdf-files that were generated with dvipdfm, but there are
a some spaces gone (soyougetsomeverylongwords - no so good for indexing).
The best result (for indexing) is produced from pdf-files that were
generated by pdflatex.

Thank you
bernhard

-- 
http://home.t-online.de/home/mb.schiekel/
GPG-Key available: GnuPG-1.2.2

Reply via email to