I have around 45 pdfs to convert into raw text containing text in _HINDI_ . When I use the xpdf package, the generated text is very weird, so I'd like to write a program which would convert the pdf text into Unicode text as it is.
The fonts used in the pdfs: name type emb sub uni object ID ------------------------------------ ----------------- --- --- --- --------- APKAPP+Usha-Bold Type 1C yes yes yes 72 0 APKBBB+Agenda-Light Type 1C yes yes yes 77 0 APKBGF+Usha Type 1C yes yes yes 41 0 APKBKJ+Agenda-Medium Type 1C yes yes yes 46 0 APKBON+Agenda-Bold Type 1C yes yes yes 49 0 For eg. in the pdf: आदमी मुसाफिर है when I use pdftotext, I get some very weird symbols: '... .......' while i'd like 'आदमी मुसाफिर है' to be the output -- Eknath Venkataramani _______________________________________________ BangPypers mailing list BangPypers@python.org http://mail.python.org/mailman/listinfo/bangpypers