So, it sounds like your update means that it is related to a specific url. I'm curious about this issue myself. I've often wondered how one could properly crawl an AJAX-ish site when you're not sure how quickly the data will be returned after the page has been.
John, your advice has really helped me. Bruce / anyone else, have you had any further experience with this type of parsing / crawling? On Mar 5, 2:50 pm, "bruce" <bedoug...@earthlink.net> wrote: > hi john... > > update... > > further investigation has revealed that apparently, for some urls/sites, the > server serves up pages that take awhile to be fetched... this appears to be > a potential problem, in that it appears that the parsescript never gets > anything from the python mech/urllib read function. > > the curious issue is that i can run a single test script, pointing to the > url, and after a bit of time.. the resulting content is fetched/downloaded > correctly. by the way, i can get the same results in my test browsing > environment, if i start it with only a subset of the urs that i've been > using to test the app. > > hmm... might be a resource issue, a timing issue,.. or something else... > hmmm... > > thanks > > again.... the problem i'm facing really has nothing to do with a specific > url... the app i have for the usc site works... > > but for any number of reasons... you might get different results when > running the app.. > -the server could be screwed up.. > -data might be cached > -data might be changed, and not updated.. > -actual app problems... > -networking issues... > -memory corruption issues... > -process constraint issues.. > -web server overload.. > -etc... > > the assumption that most people appear to make is that if you create a > parser, and run and test it once.. then if it gets you the data, it's > working.. when you run the same app.. 100s of times, and you're slamming the > webserver... then you realize that that's a vastly different animal than > simply running a snigle query a few times... > > so.. nope, i'm not running the app and getting data from a dynamic page that > hasn't finished uploading/creating the content.. > > but what my analysis is showing, not only for the usc, but for others as > well.. is that there might be differences in what gets returned... > > which is where a smoothing algorithmic approach appears to be workable.. > > i've been starting to test this approach, and it actually might have a > chance of working... > > so.. as i've stated a number of times.. focusing on a specific url isn't the > issue.. the larger issue is how you can > programatically/algorithmically/automatically, be reasonably ensured that > what you have is exactly what's on the site... > > ain't screen scraping fun!!! > > -----Original Message----- > From: python-list-bounces+bedouglas=earthlink....@python.org > > [mailto:python-list-bounces+bedouglas=earthlink....@python.org]on Behalf > Of John Nagle > Sent: Thursday, March 05, 2009 10:54 AM > To: python-l...@python.org > Subject: Re: Parsing/Crawler Questions - solution > > Philip Semanchuk wrote: > > On Mar 5, 2009, at 12:31 PM, bruce wrote: > > >> hi.. > > >> the url i'm focusing on is irrelevant to the issue i'm trying to solve at > >> this time. > > > Not if we're to understand the situation you're trying to describe. From > > what I can tell, you're saying that the target site displays different > > results each time your crawler visits it. It's as if e.g. the site knows > > about 100 courses but only displays 80 randomly chosen ones to each > > visitor. If that's the case, then it is truly bizarre. > > Agreed. The course list isn't changing that rapidly. > > I suspect the original poster is doing something like reading the DOM > of a dynamic page while the page is still updating, running a browser > in a subprocess. Is that right? > > I've had to deal with that in Javascript. My AdRater browser plug-in > (http://www.sitetruth.com/downloads) looks at Google-served ads and > rates the advertisers. There, I have to watch for page-change events > and update the annotations I'm adding to ads. > > But you don't need to work that hard here. The USC site is actually > querying a server which provides the requested data in JSON format. See > > http://web-app.usc.edu/soc/dev/scripts/soc.js > > Reverse-engineer that and you'll be able to get the underlying data. > (It's an amusing script; many little fixes to data items are performed, > something that should have been done at the database front end.) > > The way to get USC class data is this: > > 1. Start here: "http://web-app.usc.edu/soc/term_20091.html" > 2. Examine all the department pages under that page. > 3. On each page, look for the value of "coursesrc", like this: > var coursesrc = '/ws/soc/api/classes/aest/20091' > 4. For each "coursesrc" value found, construct a URL like this: > http://web-app.usc.edu/ws/soc/api/classes/aest/20091 > 5. Read that URL. This will return the department's course list in > JSON format. > 6. From the JSON tree, pull out CourseData items, which look like this: > > CourseData": > {"prefix":"AEST", > "number":"220", > "sequence":"B", > "suffix":{}, > "title":"Advanced Leadership Laboratory II", > "description":"Additional exposure to the military experience for continuing > AFROTC cadets, focusing on customs and courtesies, drill and ceremonies, and > the > environment of an Air Force officer. Credit\/No Credit.", > "units":"1", > "restriction_by_major":{}, > "restriction_by_class":{}, > "restriction_by_school":{}, > "CourseNotes":{}, > "CourseTermNotes":{}, > "prereq_text":"AEST-220A", > "coreq_text":{}, > "SectionData":{"id":"41799", > "session":"790", > "dclass_code":"D", > "title":"Advanced Leadership Laboratory II", > "section_title":{}, > "description":{}, > "notes":{}, > "type":"Lec", > "units":"1", > "spaces_available":"30", > "number_registered":"2", > "wait_qty":"0", > "canceled":"N", > "blackboard":"Y", > "comment":{}, > "day":{},"start_time":"TBA", > "end_time":"TBA", > "location":"OFFICE", > "instructor":{"last_name":"Hampton","first_name":"Daniel"}, > "syllabus":{"format":{},"filesize":{}}, > "IsDistanceLearning":"N"}}}, > > Parsing the JSON is left as an exercise for the student. (There's > a Python module for that.) > > And no, the data isn't changing; you can read those pages of JSON over and > over and get the same data every time. > > John Nagle > --http://mail.python.org/mailman/listinfo/python-list > > -- http://mail.python.org/mailman/listinfo/python-list