Vyz wrote: > I am looking for a PDF to text script. I am working with multibyte > language PDFs on Windows Xp. I need to batch convert them to text and > feed into an encoding converter program > > Thanks for any help in this regard
Multibyte languages are not easy. I do text extraction from PDF but 1) I do it on Linux and 2) I only need English text. The utility I use is pdftotext that comes as part of XPDF *nix package. The other problem however, is not with the extraction but with the fact that after you extract the text, it might not look very good. In other words, the extraction program will never complain but will nevertheless produce garbage. Then you have to process the result yourself. For example, whitespace is not consistent, sometimes there will be extra whitespace -- sometimes there won't be enough for example " S o m e w ordsloo l i k e t his" and so on... The real answer is that pdf text extraction is pretty hard. It is a 1000x better to get a hold of the original source... Nick V. -- http://mail.python.org/mailman/listinfo/python-list