[BangPypers] extracting unicode text from pdfs

Eknath Venkataramani Mon, 24 May 2010 06:43:39 -0700

I have around 45 pdfs to convert into raw text containing text in _HINDI_ .
When I use the xpdf package, the generated text is very weird, so I'd like
to write a program which would convert the pdf text into Unicode text as it
is.


The fonts used in the pdfs:
name                                   type              emb sub uni object
ID
------------------------------------ ----------------- --- --- --- ---------
APKAPP+Usha-Bold                     Type 1C           yes yes yes     72  0
APKBBB+Agenda-Light                  Type 1C           yes yes yes     77  0
APKBGF+Usha                          Type 1C           yes yes yes     41  0
APKBKJ+Agenda-Medium                 Type 1C           yes yes yes     46  0
APKBON+Agenda-Bold                   Type 1C           yes yes yes     49  0

For eg. in the pdf: आदमी मुसाफिर है
              when I use pdftotext, I get some very weird symbols: '...
.......'
             while i'd like 'आदमी मुसाफिर है' to be the output


-- 
Eknath Venkataramani
_______________________________________________
BangPypers mailing list
[email protected]
http://mail.python.org/mailman/listinfo/bangpypers

[BangPypers] extracting unicode text from pdfs

Reply via email to