Re: [Pharo-users] Soup bug(fix)

Alistair Grant Sat, 11 Nov 2017 07:40:33 -0800

On 9 November 2017 at 00:00, Kjell Godo <squeakl...@gmail.com> wrote:
> i like to collect some newspaper comics from an online newspaper
>      but it takes really long to do it by hand by hand
> i tried Soup but i didn’t get anywhere
>      the pictures were hidden behind a script or something
> is there anything to do about that?


Most of the web pages I want to scrape use javascript to construct the
DOM, which makes Soup. XMLHTMLParser, etc. useless.

I've extended Torsten's Pharo-Chrome library and use that to navigate
the DOM in a way similar to Soup:

https://github.com/akgrant43/Pharo-Chrome

This gets around the issue with javascript since it waits for the
browser to load the page, run the javascript and construct the DOM.

HTH,
Alistair



>         i don’t want to collect them all
> i have the XPath .pdf but i haven’t read it yet
>
> these browsers seem to gobble up memory
>      and while open they just keep getting bigger till the OS session crash
>      might there be a browser that is more minimal?
>
> Vivaldi seems better at not bloating up RAM

Re: [Pharo-users] Soup bug(fix)

Reply via email to