Kjell

 

Almost certainly the HTML files will not contain the code for the actual 
pictures; they will just contain an ‘href’ node with the address to load the 
picture file from. If the web pages are built to a regular pattern, you should 
be able to parse them and locate the href nodes you want. 

 

I haven’t found any problem with the parse from XMLHTMLParser taking up too 
much memory. My machine has 4GB ram; if you have much less than that, you might 
have trouble. If you have found a systematic way to locate the picture file, 
you could minimise the size of the DOM the parser creates, by using a streaming 
parser. The streaming version of Monty’s parser is called StAXHTMLParser.

 

I have a bit of experience playing with these parsers. If you get stuck, ask 
again here with more details; I may be able to help.

 

Peter Kenny

 

From: Pharo-users [mailto:pharo-users-boun...@lists.pharo.org] On Behalf Of 
Kjell Godo
Sent: 08 November 2017 23:00
To: Any question about pharo is welcome <pharo-users@lists.pharo.org>
Subject: Re: [Pharo-users] Soup bug(fix)

 

i like to collect some newspaper comics from an online newspaper

     but it takes really long to do it by hand by hand

i tried Soup but i didn’t get anywhere

     the pictures were hidden behind a script or something

is there anything to do about that?         i don’t want to collect them all

i have the XPath .pdf but i haven’t read it yet

 

these browsers seem to gobble up memory

     and while open they just keep getting bigger till the OS session crash

     might there be a browser that is more minimal?

 

Vivaldi seems better at not bloating up RAM

Reply via email to