On Mar 5, 2009, at 12:31 PM, bruce wrote:

hi..

the url i'm focusing on is irrelevant to the issue i'm trying to solve at
this time.

Not if we're to understand the situation you're trying to describe. From what I can tell, you're saying that the target site displays different results each time your crawler visits it. It's as if e.g. the site knows about 100 courses but only displays 80 randomly chosen ones to each visitor. If that's the case, then it is truly bizarre.








i think an approach will be to fire up a number of parsing attempts, and to track the returned depts/classes/etc... in theory (hopefully) i should be able to create a process to build a kind of statistical representation of what the site looks like (names of depts, names/number of classes for given
depts, etc..) if i'm correct, this would provide a complete
"list/understanding" of what the courselist looks like....

i could then run the parsing process a number of times, examining the actual value/results for the query, and taking the highest/oldest values for the given query.. the idea being that the app will return correct results for most of the queries, most of the time.. so from a statistical basis.. i can
take the results that are returned with the highest frequency...

so this approach might work. but again, haven't seen anything in the
literature/'net that talks about this...


thoughts...

thanks



-----Original Message-----
From: python-list-bounces+bedouglas=earthlink....@python.org
[mailto:python-list-bounces+bedouglas=earthlink....@python.org]on Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 8:38 AM
To: python-list@python.org
Subject: Re: Parsing/Crawler Questions..


bruce wrote:
hi john..

You're missing the issue, so a little clarification...

I've got a number of test parsers that point to a given classlist site..
the
scripts work.

the issue that one faces is that you never "know" if you've gotten all of the items/links that you're looking for based on the XPath functions. This could be due to an error in the parsing, or it could be due to an admin
changing the site (removing/adding courses etc...)

   What URLs are you looking at?

                                        John Nagle
--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list

--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to