K schrieb:
Hello everyone,
I understand that urllib and urllib2 serve as really simple page
request libraries. I was wondering if there is a library out there
that can get the HTTP requests for a given page.
Example:
URL: http://www.google.com/test.html
Something like: urllib.urlopen('http://www.google.com/
test.html').files()
Lists HTTP Requests attached to that URL:
=> http://www.google.com/test.html
=> http://www.google.com/css/google.css
=> http://www.google.com/js/js.css
There are no "Requests attached" to an url. There is a HTML-document
behind it, that might contain further external references.
The other fun part is the inclusion of JS within <script> tags, i.e.
the new Google Analytics script
=> http://www.google-analytics.com/ga.js
or css, @imports
=> http://www.google.com/css/import.css
I would like to keep track of that but I realize that py does not have
a JS engine. :( Anyone with ideas on how to track these items or am I
out of luck.
You can use e.g. BeautifulSoup to extract all links from the site.
What you can't do though is to get the requests that are issued by
Javascript that is *running*.
Diez
--
http://mail.python.org/mailman/listinfo/python-list