Hi,
I suggest you have a look at Apache TIKA: http://tika.apache.org
You can easily call a "java -jar tika.jar" command via python tools like
os.popen and convert files in various formats to text.
There's even a python wrapper based on JCC but I'm not sure if that's still
maintained:
http://red
Hello sir,
Thank you for the quick reply. I want to integrate this functionality with
web2py, So i would need to stick with python and Pylucene. So the method
you are saying is like, extracting text from all the document using
different python libraries, and then Indexing the data, then Search the