On Oct 12, 2:28 pm, Philip Semanchuk <[EMAIL PROTECTED]> wrote: > On Oct 12, 2008, at 5:25 AM, S.SelvamSivawrote: > > > I have to do a parsing on webpagesand fetch urls.My problem is ,many > > urls i > > need to parse are dynamically loaded usingjavascriptfunction > > (onload()).How to fetch those links from python? Thanks in advance. > > Selvam, > You can try to find them yourself using string parsing, but that's > difficult. The closer you want to get to "perfect" at finding URLs > expressed in JS, the closer you'll get to rewriting a JS interpreter. > For instance, this is not so hard to understand: > "http://example.com/" > but this is: > "http://ZZZ_DOMAIN_ZZZ/index.html".replace(/ZZZ_DOMAIN_ZZZ/, > the_domain_variable) > > This is a long-standing problem for any program that parses Web pages.
yep :) > You either have to embed a JS interpreter in your application or yep. there are several. pyv8 is the newest addition: http://advogato.org/article/985.html it's a python wrapper around google's v8 javascript execution library. then there's pykhtml: http://paul.giannaros.org/pykhtml/ it's a python wrapper around KHTML, providing very convenient access to KDE's HTML capabilities: what pykhtml does is "pretends" that the GUI part of KDE doesn't exist, so you can run your program as a command-line shell; it will execute the javascript, which you will have to wait a bit for of course; then you can walk the DOM tree (using pykhtml bindings) using pykhtml.DOM.getElementById() and getElementsByTagName("a") etc. etc. looking for the URLs. there's even an AJAX example included which does 1-second polling of the DOM model, waiting for a spell-checking web site to deliver the answer. then there's webkit, with the new glib bindings: https://bugs.webkit.org/show_bug.cgi?id=16401 which are then followed up by python bindings to _those_ bindings: http://code.google.com/p/pywebkitgtk/issues/detail?id=13 this will also allow you to execute arbitrary javascript - again, it's similar to KHTML and in fact webkit really _is_ the KDE KHTML code (JavaScriptCore, KJS etc) but forked, improved, etc. etc. unfortunately, the glib bindings are tied - at three key and strategic locations - to gtk at the moment, which will take _very_ little work to "un"tie them [pay me and i'll do the work], so you would need to create a blank gtk window - just like is done with pykhtml, behind the scenes. it would be a very simple task to create a "dummy" - console-based - port of webkit, providing an array of callbacks which you must hand to the library. at the moment, the design of webkit is not particularly good in this respect: there are three ports, gtk, wx and qt, which are heavily tied in to webkit. it would be a _far_ better design to be passing in a struct containing function callbacks (rather a lot of them - about eighty!) and then what you could do is have a "console"- based port of webkit, which would do the job you needed. alternatively, if you don't mind wrapping a binary application with e.g. Popen3 then look at the webkit DumpRenderTree application, paying particular attention to using the --html option. you won't have any control over how long the javascript is executed for. after an arbitrary and small period of time, DumpRenderTree _stops_ executing the javascript and prints out the HTML DOM model (in a non-html-layout fashion - it's used for debugging and testing purposes but will suffice for your purposes). so, as it stands, pywebkitgtk is _no worse_ than pykhtml, but with a little bit of tweaking, the "gtk" could be removed from "pywebkitgtk" and you'd end up with... ohh... call it "pywebkitglib" ... which would be much better as a stand-alone library, for your purposes then there's also "spidermonkey", which is mozilla's javascript engine. i haven't investigated this option: haven't had a need to. then there's also PyXPCOMExt, which is embedding python into mozilla, and from there you have PyDOM, which allows you access to the DOM model of the mozilla "thing". so, if you don't mind embedding your application into XULRunner, you've got a home for executing your app and obtaining the urls, post-javascript-execution. the neat thing about PyXPCOMExt is that you have complete and full access to python - so your app can make external TCP and UDP sockets, you can embed an entire _server_ in the damn thing if you want (you could embed... python-twisted if you wanted!) you can access the filesystem - anything. absolutely anything. reason: the _entire_ python suite is embedded into the browser. every single bit of it. that's about all i've been able to find, so far. there might be more options out there. not that there aren't enough already :) all of them will allow you complete and full access to execution of javascript, including AJAX execution. which is why you'll need to do that "polling" trick in many instances. l. -- http://mail.python.org/mailman/listinfo/python-list