Re: PDF to text script

Nick Vatamaniuc Fri, 10 Nov 2006 19:30:53 -0800

Vyz wrote:
> I am looking for a PDF to text script. I am working with multibyte
> language PDFs on Windows Xp. I need to batch convert them to text and
> feed into an encoding converter program
>
> Thanks for any help in this regard


Multibyte languages are not easy.  I do text extraction from PDF but 1)
I do it on Linux and 2) I only need English text. The utility I use is
pdftotext that comes as part of XPDF *nix package.

The other problem however, is not with the extraction but with the fact
that after you extract the text, it might not look very good.  In other
words, the extraction program will never complain but will nevertheless
produce garbage.  Then you have to process the result yourself. For
example, whitespace is not consistent, sometimes there will be extra
whitespace -- sometimes there won't be enough for example " S o m  e
  w ordsloo l i k e t his" and so on...

The real answer is that pdf text extraction is pretty hard. It is a
1000x better to get a hold of the original source...

Nick V.

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: PDF to text script

Reply via email to