On Mar 5, 2009, at 12:31 PM, bruce wrote:
hi..
the url i'm focusing on is irrelevant to the issue i'm trying to
solve at
this time.
Not if we're to understand the situation you're trying to describe.
From what I can tell, you're saying that the target site displays
different results each time your crawler visits it. It's as if e.g.
the site knows about 100 courses but only displays 80 randomly chosen
ones to each visitor. If that's the case, then it is truly bizarre.
i think an approach will be to fire up a number of parsing attempts,
and to
track the returned depts/classes/etc... in theory (hopefully) i
should be
able to create a process to build a kind of statistical
representation of
what the site looks like (names of depts, names/number of classes
for given
depts, etc..) if i'm correct, this would provide a complete
"list/understanding" of what the courselist looks like....
i could then run the parsing process a number of times, examining
the actual
value/results for the query, and taking the highest/oldest values
for the
given query.. the idea being that the app will return correct
results for
most of the queries, most of the time.. so from a statistical
basis.. i can
take the results that are returned with the highest frequency...
so this approach might work. but again, haven't seen anything in the
literature/'net that talks about this...
thoughts...
thanks
-----Original Message-----
From: python-list-bounces+bedouglas=earthlink....@python.org
[mailto:python-list-bounces+bedouglas=earthlink....@python.org]on
Behalf
Of John Nagle
Sent: Thursday, March 05, 2009 8:38 AM
To: python-list@python.org
Subject: Re: Parsing/Crawler Questions..
bruce wrote:
hi john..
You're missing the issue, so a little clarification...
I've got a number of test parsers that point to a given classlist
site..
the
scripts work.
the issue that one faces is that you never "know" if you've gotten
all of
the items/links that you're looking for based on the XPath
functions. This
could be due to an error in the parsing, or it could be due to an
admin
changing the site (removing/adding courses etc...)
What URLs are you looking at?
John Nagle
--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list
--
http://mail.python.org/mailman/listinfo/python-list