On 9 November 2017 at 00:00, Kjell Godo <squeakl...@gmail.com> wrote: > i like to collect some newspaper comics from an online newspaper > but it takes really long to do it by hand by hand > i tried Soup but i didn’t get anywhere > the pictures were hidden behind a script or something > is there anything to do about that?
Most of the web pages I want to scrape use javascript to construct the DOM, which makes Soup. XMLHTMLParser, etc. useless. I've extended Torsten's Pharo-Chrome library and use that to navigate the DOM in a way similar to Soup: https://github.com/akgrant43/Pharo-Chrome This gets around the issue with javascript since it waits for the browser to load the page, run the javascript and construct the DOM. HTH, Alistair > i don’t want to collect them all > i have the XPath .pdf but i haven’t read it yet > > these browsers seem to gobble up memory > and while open they just keep getting bigger till the OS session crash > might there be a browser that is more minimal? > > Vivaldi seems better at not bloating up RAM