As you say, Python is not the solution, short term... Plus I don't really want to use it. I suppose it may make some task easier, if you have the right library... But it adds another interpreted language in the mix, and I would rather avoid it.
Anyway, I am not using A WebView at all: I cerate a WebPage, then use setHtml on the file I downloaded, then walk the dom for the nodes I need. I tried with the Xmlparser first, since the site advertises xhtml... but the xml is really broken. If I have to rewrite the "data miner", i will simply go over the html to match the right regexp. It looks like the tidy library could do what I need; it is a dependency as well, but it should be a ligher one than Python (and qtWebkit, but that was very convenient...). Also, I have one data source for now, but I expect to have at least one more in the future, possibly more if I find data sources in other countirs, so I will need different data extractors. In any case... The library will still be there, right? This will only prevent my application to be allowed on the Harbour store? I could live with that; my application could live in a third party repository. Luciano On Tue, Nov 26, 2013 at 7:05 AM, Thomas Perl <th.p...@gmail.com> wrote: > Hi, > > 2013/11/26 Luciano Montanaro <mikel...@gmail.com>: >> On Nov 26, 2013 2:07 AM, "Robin Burchell" <robin.burch...@jolla.com> wrote: >> [...] >> My application too depends on it to scrape data from a web page. I need the >> QWebElement interface, otherwise I will need to parse the html on my own. >> [...] >> Well, access to the DOM model... > > Depending on how JavaScript-laden the page you are trying to scrape > is, something like BeautifulSoup or Mechanize (both written in Python; > the latter one might sound familiar to Perl programmers, it’s designed > after WWW:Mechanize) might do the job, and in a more lightweight way > (no need to download images or execute JS / layout the page for simple > scraping): > > http://www.crummy.com/software/BeautifulSoup/ > http://wwwsearch.sourceforge.net/mechanize/ > > Of course, this drags in a new dependency that also isn’t supported at > the moment (Python), but as mentioned in the announcement[1], "we are > actively working on getting Python support into shape”, and once that > will be supported (PyOtherSide QML Plugin), it might be easier to > integrate and more efficient than moving the whole webpage through a > WebView and going through that with the DOM. > > And if your page is JavaScript-laden, and you can’t parse the static > HTML using BeautifulSoup or Mechanize, chances are the data parsed by > JavaScript is also available as JSON somewhere (just look into the > webpage code / watch the traffic) - and that’ll definitely be easier > to parse, too :) > > HTH :) > Thomas > > [1] https://lists.sailfishos.org/pipermail/devel/2013-November/001319.html > _______________________________________________ > SailfishOS.org Devel mailing list -- Luciano Montanaro Anyone who is capable of getting themselves made President should on no account be allowed to do the job. -- Douglas Adams _______________________________________________ SailfishOS.org Devel mailing list