You may want to try out pdfminer. Its very similar to xpdf in structure and should give you the parsed data into unicode directly.
On Mon, May 24, 2010 at 7:13 PM, Eknath Venkataramani <eknath.i...@gmail.com > wrote: > I have around 45 pdfs to convert into raw text containing text in _HINDI_ . > When I use the xpdf package, the generated text is very weird, so I'd like > to write a program which would convert the pdf text into Unicode text as it > is. > > The fonts used in the pdfs: > name type emb sub uni object > ID > ------------------------------------ ----------------- --- --- --- > --------- > APKAPP+Usha-Bold Type 1C yes yes yes 72 > 0 > APKBBB+Agenda-Light Type 1C yes yes yes 77 > 0 > APKBGF+Usha Type 1C yes yes yes 41 > 0 > APKBKJ+Agenda-Medium Type 1C yes yes yes 46 > 0 > APKBON+Agenda-Bold Type 1C yes yes yes 49 > 0 > > For eg. in the pdf: आदमी मुसाफिर है > when I use pdftotext, I get some very weird symbols: '... > .......' > while i'd like 'आदमी मुसाफिर है' to be the output > > > -- > Eknath Venkataramani > _______________________________________________ > BangPypers mailing list > bangpyp...@python.org > http://mail.python.org/mailman/listinfo/bangpypers > -- -------------------------------------------------------- blog: http://blog.dhananjaynene.com twitter: http://twitter.com/dnene
-- http://mail.python.org/mailman/listinfo/python-list