On Aug 16, 12:02 am, Mike Paul <paul.mik...@gmail.com> wrote: > I'm trying to scrap a dynamic page with lot ofjavascriptin it. > Inorder to get all the data from the page i need to access thejavascript. But > i've no idea how to do it. > > Say I'm scraping some site htttp://www.xyz.com/xyz > > request=urllib2.Request("htttp://www.xyz.com/xyz") > response=urllib2.urlopen(request) > data=response.read() > > So i get all the data on the initial page. Now i need to access > thejavascripton this page to get additional details. I've heard someone > telling me to use spidermonkey. But no idea on how to send javscript > as request and get the response. How hsuld i be sending thejavascript > request as ? how can it be sent?
you need to actually _execute_ the web page under a browser engine. you will not be able to do what you want, using urllib. > Can anyone tell me how can i do it very clearly. I've been breaking my > head into this for the past few days with no progress. there are about four or five engines that you can use, depending on the target platform. see: http://wiki.python.org/moin/WebBrowserProgramming 1) python-khtml (pykhtml). 2) pywebkitgtk (with DOM / glib-gobject bindings patches) 3) python-hulahop and xulrunner 4) Trident (the MSHTML engine behind IE) accessed through python comtypes 5) macosx objective-c bindings and use pyobjc from there. options 2-4 i have successfully used and proven that it can be done: http://pyjamas.svn.sourceforge.net/viewvc/pyjamas/trunk/pyjd/ option 1) i haven't done due to an obscure bug in the KDE KHTML python- c++ bindings; option 2) i haven't done because there's no point: XMLHttpRequest has been deliberately excluded due to short-sightedness of the webkit developers, which has only recently been corrected (but the work still needs to be done). so, using a web browser engine, you must load and execute the page, and then you can use DOM manipulation to extract the web page text, after a certain amount of time has elapsed, and the javascript has completed execution. if you _really_ want to create your own javascript execution engine, which, my god it will be a hell of a lot of work but would be extremely beneficial, you would do well to help flier liu with pyv8, and paul bonser with pybrowser. flier is doing a web-site-scraping system, browsing millions of pages and executing the javascript under pyv8. paul is implementing a web browser in pure python (using python-cairo as the graphics engine). he's got part-way through the project, having focussed initially on a W3C standards-compliant implementation of the DOM, and less on the graphics side. that means that what paul has will be somewhat more suited to what you need, because you don't want a graphics engine at all. if paul's work isn't suitable, then the above engines you will simply have to run _without_ firing up the actual GUI window. in the GTK- based engines, you just... don't call show() or show_all(); in the MSHTML-based one, i presume you just don't fire a WM_SHOW event at it. you'll work it out. l. -- http://mail.python.org/mailman/listinfo/python-list