Re: html->array

2015-11-14 Thread Colin Holgate
When I wrote something to do page scraping I used mostly string tricks to reduce the source to a list of what I wanted. Below is the whole script, and the different handlers let me get the text, images, videos, or links from a page: global gPageURL function getText pPageSource put replace

Re: html->array

2015-11-14 Thread Matt Maier
I imagine the problem isn't storing it, but rather dealing with all the possible exceptions to proper html. Perhaps you could pass it through some kind of html fixer. I found a half dozen websites and an open source project with one Google search http://tidy.sourceforge.net/ On Nov 14, 2015 06:50,