Re: Parsing/Crawler Questions..

Philip Semanchuk Wed, 04 Mar 2009 18:15:25 -0800


On Mar 4, 2009, at 4:44 PM, bruce wrote:

Hi...

Sorry that this is a bit off track. Ok, maybe way off track!

But I don't have anyone to bounce this off of..
I'm working on a crawling project, crawling a college website, toextractcourse/class information. I've built a quick test app in python tocrawl thesite. I crawl at the top level, and work my way down to getting therequiredcourse/class schedule. The app works. I can consistently run it andextractthe information. The required information is based upon an XPathanalysis of
the DOM for the given pages that I'm parsing.
My issue is now that I have a "basic" app that works, I need tofigure outhow I guarantee that I'm correctly crawling the site. How do I knowwhenI've got an error at a given node/branch, so that the app knows thatit's
not going to fetch the underlying branch/nodes of the tree..
When running the app, I can get 5000 classes on one run, 4700 onantoher,etc... So I need some method of determining when I get a "complete"tree...
How do I know when I have a complete "tree"!



hi Bruce,

To put this another way, you're trying to convince yourself that yourprogram is correct, yes? For instance, you're worried that you mightbe doing something like discovering a URL on a site but failing topursue that URL, yes?

The standard way of testing any program is to test known input andlook for expected output. Repeat as necessary. In your case that wouldmean crawling a site where you know all of the URLs and to see if yourprogram finds them all. And that, of course, isn't proof ofcorrectness, it just means that that particular site didn't triggerany error conditions that would cause your program to misbehave.

I think every modern OS makes it easy to run a Web server on yourlocal machine. You might want to set up suite of test sites on yourmachine and point your program at localhost. That way you can build asite to test your application in areas you fear it may be weak.

I'm unclear on what you're using to parse the pages, but (X)HTML isvery often invalid in the strict sense of validity. If the toolsyou're using expect/insist on well-formed XML or valid HTML, they'llbe disappointed on most sites and you'll probably be missing URLs. Thecanonical solution for parsing real-world Web pages with Python isBeautifulSoup.


HTH
Philip






--
http://mail.python.org/mailman/listinfo/python-list

Re: Parsing/Crawler Questions..

Reply via email to