On 28/01/2014 9:45 PM, kevinglove...@gmail.com wrote:
I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX 
module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a 
count of the number of hits for each search. Typical phrase-searches might be "of the 
dog" and "dog's".

I have some limited prior programming experience (from many years ago) and I am 
currently learning Python from a course of YouTube tutorials. Before I get much 
further, I wanted to ask:

Is what I am trying to do actually feasible?

Rather than parsing through 40GB+ every time you need to do a search, you should get better performance using an XML database which will allow you to do queries directly on the xml data.

http://basex.org/ is one such db, and comes with a Python API:

http://docs.basex.org/wiki/Clients

--
https://mail.python.org/mailman/listinfo/python-list

Reply via email to