HI Raj, Your reply makes me believe i can finally work ahead with the project, here is the code i am using, and the pdf file i am using for the test purpose is this http://pib.nic.in/archieve/railbudget/rbudget2010/RBspeechHin.pdf, how do i find out the encoding in the file ?
My code is : *import pyPdf* *from BeautifulSoup import BeautifulSoup* *f=open('conv.txt','w')* *pdf = pyPdf.PdfFileReader(open("RBspeechHin.pdf", "rb"))* *#for page in pdf.pages:* *c=pdf.getPage(1).extractText()* *soup=BeautifulSoup(c)* *soup.originalEncoding* *print BeautifulSoup(c).prettify()* *f.write(soup)* * * i was working with some html program before this and used BF for encoding, so tried my luck here too and it din't work :( if u can help me for just the pdf mentioned above also it will suffice, i will try learning from that :) * * *Cheers,* Aaditya* * On Wed, Jun 2, 2010 at 3:28 PM, Amal <raj.a...@gmail.com> wrote: > Hi Aaditya, > Actually reading hindi text is not as simple as reading english text. Most > of the Hindi PDFs don't have standard encoding. > > And Encoding is value given to each Unicode code point. > And each encoding corresponds to font representation. > So a PDF takes the encoding, maps it to a font using a Font map and then > renders the font. It does not know what character it is. > So For reading most of hindi PDFs, we have to know the encoding to > character > mapping. > > I worked in my previous company with Dainik Bhaskar, and other hindi > newspaper PDFs and faced the same problem. > So a generic hindi PDF to text is not possible. > > But if u know a specific encoding, then u u might be able to write a > specific Hindi PDF to text. > > Amal. > > On Wed, Jun 2, 2010 at 2:50 AM, Srinivas Reddy Thatiparthy < > srinivas_thatipar...@akebonosoft.com> wrote: > > > Hindhi is a unicode text , your input data should be treated as Unicode > > instead of > > ASCII and last but not the least the encoding format in editor should be > > set to unicode ,otherwise you see garbled text. > > > > > > This is my guess , i have never worked with unicode in python.If i am > wrong > > please correct me. > > > > Thanks&Regards, > > Srinivas Reddy Thatiparthy, > > Mobile:9393099772, > > > > > > > > -----Original Message----- > > From: bangpypers-bounces+srinivas_thatiparthy=akebonosoft....@python.orgon > behalf of AADITYA SRIRAM > > Sent: Wed 6/2/2010 2:22 PM > > To: bangpypers@python.org > > Subject: [BangPypers] PyPDF to read hindi > > > > Hi guys, i am writing a small program to convert pdf to text files(i know > > its easy and lame but need to start somewhere !!), anyway i am not bale > to > > rip the hindi text in readable form :( can anyone please help ? Its > working > > fine with english text . > > _______________________________________________ > > BangPypers mailing list > > BangPypers@python.org > > http://mail.python.org/mailman/listinfo/bangpypers > > > > > > _______________________________________________ > > BangPypers mailing list > > BangPypers@python.org > > http://mail.python.org/mailman/listinfo/bangpypers > > > > > _______________________________________________ > BangPypers mailing list > BangPypers@python.org > http://mail.python.org/mailman/listinfo/bangpypers > _______________________________________________ BangPypers mailing list BangPypers@python.org http://mail.python.org/mailman/listinfo/bangpypers