As you say, Python is not the solution, short term...
Plus I don't really want to use it. I suppose it may make some task
easier, if you have the right library... But it adds another
interpreted language in the mix, and I would rather avoid it.

Anyway, I am not using A WebView at all: I cerate a WebPage, then use
setHtml on the file I downloaded, then walk the dom for the nodes I

I tried with the Xmlparser first, since the site advertises xhtml...
but the xml is really broken.

If I have to rewrite the "data miner", i will simply go over the html
to match the right regexp.
It looks like the tidy library could do what I need; it is a
dependency as well, but it should be a ligher one than Python (and
qtWebkit, but that was very convenient...).

Also, I have one data source for now, but I expect to have at least
one more in the future, possibly more if I find data sources in other
countirs, so I will need different data extractors.

In any case... The library will still be there, right? This will only
prevent my application to be allowed on the Harbour store? I could
live with that; my application could live in a third party repository.


On Tue, Nov 26, 2013 at 7:05 AM, Thomas Perl <> wrote:
> Hi,
> 2013/11/26 Luciano Montanaro <>:
>> On Nov 26, 2013 2:07 AM, "Robin Burchell" <> wrote:
>> [...]
>> My application too depends on it to scrape data from a web page. I need the
>> QWebElement interface, otherwise I will need to parse the html on my own.
>> [...]
>> Well, access to the DOM model...
> Depending on how JavaScript-laden the page you are trying to scrape
> is, something like BeautifulSoup or Mechanize (both written in Python;
> the latter one might sound familiar to Perl programmers, it’s designed
> after WWW:Mechanize) might do the job, and in a more lightweight way
> (no need to download images or execute JS / layout the page for simple
> scraping):
> Of course, this drags in a new dependency that also isn’t supported at
> the moment (Python), but as mentioned in the announcement[1], "we are
> actively working on getting Python support into shape”, and once that
> will be supported (PyOtherSide QML Plugin), it might be easier to
> integrate and more efficient than moving the whole webpage through a
> WebView and going through that with the DOM.
> And if your page is JavaScript-laden, and you can’t parse the static
> HTML using BeautifulSoup or Mechanize, chances are the data parsed by
> JavaScript is also available as JSON somewhere (just look into the
> webpage code / watch the traffic) - and that’ll definitely be easier
> to parse, too :)
> HTH :)
> Thomas
> [1]
> _______________________________________________
> Devel mailing list

Luciano Montanaro

Anyone who is capable of getting themselves made President should on
no account be allowed to do the job. -- Douglas Adams
_______________________________________________ Devel mailing list

Reply via email to