M.B. Schiekel schrieb: > I´m trying to establish a document server with htdig under SuSE-8.2. In > this context I also tried to build an index of pdf-files, created with > LyX, with the htdig external_parsers method. > This works for all my pdf-files, except the ones from LyX-sources. > ... > BUGS > Some PDF files contain fonts whose encodings have been > mangled beyond recognition. There is no way (short of > OCR) to extract text from these files.
After some additional research, I found, that pdftotext really has problems with the recognition of some font encodings. One can list the used fonts with pdffonts, also part of package xpdf. Version 2.0.1-49 of pdftotext can not handle pdf-files, that were generated with dvi2ps->ps2pdf. It can handle pdf-files that were generated with dvipdfm, but there are a some spaces gone (soyougetsomeverylongwords - no so good for indexing). The best result (for indexing) is produced from pdf-files that were generated by pdflatex. Thank you bernhard -- http://home.t-online.de/home/mb.schiekel/ GPG-Key available: GnuPG-1.2.2