hi john.. You're missing the issue, so a little clarification...
I've got a number of test parsers that point to a given classlist site.. the scripts work. the issue that one faces is that you never "know" if you've gotten all of the items/links that you're looking for based on the XPath functions. This could be due to an error in the parsing, or it could be due to an admin changing the site (removing/adding courses etc...) So I'm trying to figure out an approach to handling these issues... As far as I can tell... An approach might be to run the parser script across the target site X number of times within a narrow timeframe (a few minutes). Based on the results of this process, you might be able to develop an overall "tree" of what the actual class/course links/list should be. But you don't know from hour to hour, day to day if this list is stable, as it could change.. The only way you know for certain is to physically examine a site. You can't do this if you're going to develop an automated system for 5-10 sites, or for 500-1000... These are the issues that I'm grappling with.. not how to write the XPath parsing functions... Thanks.. -----Original Message----- From: python-list-bounces+bedouglas=earthlink....@python.org [mailto:python-list-bounces+bedouglas=earthlink....@python.org]on Behalf Of John Nagle Sent: Wednesday, March 04, 2009 10:23 PM To: python-list@python.org Subject: Re: Parsing/Crawler Questions.. bruce wrote: > hi phillip... > > thanks for taking a sec to reply... > > i'm solid on the test app i've created.. but as an example.. i have a parse > for usc (southern cal) and it exrtacts the courselist/class schedule... my > issue was that i realized the multiple runs of the app was giving differentt > results... in my case, the class schedule isn't static.. (actually, none of > the class/course lists need be static.. they could easily change). > > so i don't have apriori knowledge of what the actual class/course list site > would look like, unless i physically examined the site, each time i run the > app... > > i'm inclined to think i might need to run the parser a number of times > within a given time frame, and then take a union/join of the output of the > different runs.. this would in theory, give me a high probablity that i'd > get 100% of the class list... I think I see the problem. I took a look at the USC class list, and it's been made "Web 2.0". When you read the page, you don't get the class list; you get a Javascript thing that builds a class list on demand, using JSON, no less. See "http://web-app.usc.edu/soc/term_20091.html". I'm not sure how you're handling this. The Javascript actually has to be run before you get anything. John Nagle -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list