On 28/01/2014 9:45 PM, kevinglove...@gmail.com wrote:
I have downloaded and unzipped the xml dump of Wikipedia (40+GB). I want to use Python and the SAX
module (running under Windows 7) to carry out off-line phrase-searches of Wikipedia and to return a
count of the number of hits for each search. Typical phrase-searches might be "of the
dog" and "dog's".
I have some limited prior programming experience (from many years ago) and I am
currently learning Python from a course of YouTube tutorials. Before I get much
further, I wanted to ask:
Is what I am trying to do actually feasible?
Rather than parsing through 40GB+ every time you need to do a search,
you should get better performance using an XML database which will allow
you to do queries directly on the xml data.
http://basex.org/ is one such db, and comes with a Python API:
http://docs.basex.org/wiki/Clients
--
https://mail.python.org/mailman/listinfo/python-list