Hi, I suggest you have a look at Apache TIKA: http://tika.apache.org
You can easily call a "java -jar tika.jar" command via python tools like os.popen and convert files in various formats to text. There's even a python wrapper based on JCC but I'm not sure if that's still maintained: http://redmine.djity.net/projects/pythontika/wiki Regards, Thomas -- Am 11.06.2013 um 12:05 schrieb Vishrut Mehta <vishrut.mehta...@gmail.com>: > Hello Everyone, > I am Vishrut Mehta, currently a third year students at IIIT > Hyderabad, India. I have been contributing to Open Source since two years > and also have contributed to organizations like E-cidadania, Sahana > Software Foundation, Gnome, etc. I am very interested in Search engines and > search related libraries. > > I need some help from the community, I am currently working > on a project which deals with the follow issue - Need to search within any > uploaded documents(like .pdf, .doc, etc) from the user and need to > search text or strings within those documents. Can anyone help me for this, > it would be a great help ?! > > Thanks You! > Regards, > -- > > *Vishrut Mehta* > International Institute of Information Technology, > Gachibowli,Hyderabad-500032