On Tue, Aug 11, 2009 at 8:53 PM, David C Ullrich<dullr...@sprynet.com> wrote:
> Try reading a little there! Starting there I went to > > http://en.wikipedia.org/wiki/Wikipedia:Creating_a_bot > > where I found a section on existing bots, comments on how the "scraping" > is not what you want, and even a Python section with a link to something > labelled PyWikipediaBot... Some information on using the PyWikipediaBot for scraping from someone who used to program on the bot (and occasionally still does): To make the framework work, you need to add a file user-config.py with the following contents: family = 'wikipedia' mylang = 'en' If you want to use the bot to also edit pages on wikipedia, you will have to add: usernames['wikipedia']['en'] = <the username of your bot> If you work on another language of course you use that language's abbreviation instead of en. The heart of the framework is the file wikipedia.py, you need to import that one. It contains two important classes: Page and Site, which represent a wikipedia page and the site as a whole, respectively. It is best to put your code in a try like this: try: mysite = wikipedia.getSite() <your code here> finally: wikipedia.stopme() The stopme() functionality has to do with the bot's behaviour to avoid over-feeding the server with requests. It has a certain time (default is 10 seconds) between two requests, but if you have several bots running, it will lengthen this time. stopme() tells that the bot is not running any more, so other runs are not delayed by it. wikipedia.getSite() gets the site object for your default site (if the settings above are chosen it is the English language Wikipedia). Still with me? Good, because now we get into the real programming. The Page class has as its __init__: def __init__(self, site, title, insite=None, defaultNamespace=0): site is here the wiki on which the page exists (usually this will be mysite, which is why I defined it above), title the title of the page. The optional parameters are for special usage. The Page class has a number of methods, which you can find in the file, but some of the most important are: page.title() - the title of the page page.site() - the wiki the page is on page.get() - the (wiki) text of the page page.put(text) - saves the page with 'text' as its new content. An important optional parameter is 'comment', which specifies the summary that is given with the change page.exists() - a boolean, true if the page exists, false otherwise page.linkedPages() - a list of Page objects, being the pages the page links to However, instead of page.get() it is advisable to use: wikipedia.getall(site,pages) with 'site' being a Site object (e.g. mysite) and pages a list (or more generally, iterable) of Page objects. It will get all pages in the list using a single call to the wiki, thus speeding up your bot and at the same time reducing its load on the wiki. Once a page has been loaded (either through get or through getall), subsequent calls to page.get() will not reload it. Thus, the normal way of working is to create a list of pages one is interested in, use getall (in groups of 60 or so) to load them, then use get to work with them. Another useful file in the framework is pagegenerators. It provides a number of generators that yield Page objects. Some interesting ones (check the code for the exact parameters): AllpagesPageGenerator: generates all pages of the wiki, alphabetically from a specified begin ReferringPageGenerator: all pages linking to a given page CategorizedPageGenerator: all pages in a given directory LinkedPageGenerator: all pages linked to from a given page Other generators are used by 'wrapping them around' a given generator. The most important of these is the PreloadingGenerator, which ensures that the page are preloaded (using wikipedia.getall) in groups. A simple way to use the bot framework to scrape all pages of the English Wikipedia (warning: This takes a few days!) would be: import wikipedia import pagegenerators basicgen = pagegenerators.AllpagesPageGenerator(includeredirects = False) generator = pagegenerators.PreloadingGenerator(basicgen, 200) for page in generator: title = page.title() text = page.get() <do whatever you want with title and text> -- André Engels, andreeng...@gmail.com -- http://mail.python.org/mailman/listinfo/python-list