On Mar 18, 8:01 pm, Greg <gregsaundersem...@gmail.com> wrote: > Hello all, I've been trying to find a way to fetch and read a web page > that requires javascript on the client side and it seems impossible.
you're right: it's not impossible. > I've read several threads in this group that say as much but I just > can't believe it to be true you're right: it's not true. > (I'm subscribing to the "argument of > personal incredulity " here). there are several approaches that you can take that combine python and javascript: none of them are at the level of "simplicity" which you and many others may be expecting, which is why it's believed to be "impossible" or "not achievable". they all have different advantages and disadvantages - don't be surprised if you end up with 30 mb of binaries on your system, _just_ to support the features you're implicitly asking for, ok? here's the approaches i've found so far: 1) python-spidermonkey python-spidermonkey "rips out" the mozilla javascript engine and provides you with a hybrid mechanism where the execution context can be shared between the two languages. in other words, variables and functions can be shoved into the namespace of the spidermonkey javascript context and executed; python can likewise (in a rather clunky way at the moment) gain access to the execution context and "call in". what this approach does NOT have is the "DOM model" functions. those have been REMOVED as they are ONLY part of the W3C specification for implementation of web browsers, NOT the ECMAScript specification. 2) PyV8 - http://code.google.com/p/pyv8 take 1) above, and sed -e "s/python-spidermonkey/pyv8/g" flier liu, the author of pyv8, is actually _doing_ what you want to do. namely, he's started with a combination of python plus google's V8 javascript engine, and he's now moving on to implementing the DOM as *python*, for execution as a python console-only application. he recognises the need for execution of javascript, as part of the requirement, and that's the reason why he has added google v8. by doing this "hybrid", he will be able to "add" a global variable called "document" to the javascript context, and another global variable called "window" to the javascript context, etc. etc. and then "execution" of the javascript will result in callbacks - into python - to emulate, in its entirety, the complete W3C DOM model standard. far be it for me to tell him how monstrously large the task of reimplementing the W3C DOM standard in python, i urge you to consider helping him out with his project. 3) pywebkitgtk (+patch #13) + webkit-glib/gdom (+patch #16401) this one's a whopping-great project that takes the ENTIRE webkit engine, patched to include glib / gobject bindings, so that python can "get at" the DOM model, directly. you can use this to "execute" a web page - bear in mind that GTK apps do NOT have to be "visible" - you CAN "run" a GTK app WITHOUT actually putting up an on-screen GUI widget. in this way, you will be able to "load" a web page, have it be "executed", and then, after a specific and arbitrary amount of time, run some python using the python-bindings to the DOM model to either "walk" the DOM model or just call the "toString()" method and obtain a flat HTML representation of the entire page. CAVEATS: apple's employees are flexing their muscles and are unfortunately showing that they have power and control by limiting the functionality of the glib / gobject bindings to "that which they deem to be acceptable". apple's employees have deemed that strict compliance to the W3C standard is how they want things to be, and are ignoring the fact that the de-facto standard is actually that specified by Javascript implementations. in other words, toString, being a de-facto standard, is "unacceptable" to them, as are a couple of other things. 4) python-hulahop exactly the same as 3) except using mozilla not webkit: hulahop is the ENTIRE gecko engine, with python bindings via the XUL interface. the hulahop team are the ONLY people who have been able to understand the obtuse XUL interface enough to be able to make python bindings actually _work_ :) it's clear that the OLPC / SUGAR team looked at webkit, initially, and loved it. however, they saw the lack of glib/gobject bindings, and the lack of python bindings, and freaked out (whereas i, rather stupidly, went "nooo problem saah!" and _added_ glib / gobject bindings to webkit) so they then went "ahhhh, safety", abandoned webkit and made a beeline for XUL. so they have complete and total control over the DOM model, from python, including (thanks to gecko's ability to execute javascript using spidermonkey) the ability to interact two-way with javascript (exactly as can be done with webkit's glib/gobject + pywebkitgtk bindings). so - _again_ - you have the choice of being able to run a GTK app - without an actual "window" - load up a web page and then tell the XUL / Gecko engine "GO! EXECUTE JAVASCRIPT!", and then, at some point in the future, walk the DOM model using the python XUL bindings or call the document.toString() method, from python, and obtain the resultant HTML. so - the answer to your question is: yes, it's technically possible. and yes, it's even been done (twice). successfully. in two separate and distinct ways, with at least a third in active development that i know of, and a fourth method as a possible candidate for the basis of a fourth alternative. but i have to warn you - these are _not_ small projects: you're relying on and leveraging the expertise of e.g. Webkit means that you're backed by MAN CENTURIES of effort ( see the statistics e.g. on http://www.ohloh.net/p/WebKit : an estimated 480 man-years of time spent so far - if you look at mozilla you'll find it's a similar amount ) l. -- http://mail.python.org/mailman/listinfo/python-list