When I wrote something to do page scraping I used mostly string tricks to
reduce the source to a list of what I wanted. Below is the whole script, and
the different handlers let me get the text, images, videos, or links from a
page:
global gPageURL
function getText pPageSource
put
replace
I imagine the problem isn't storing it, but rather dealing with all the
possible exceptions to proper html.
Perhaps you could pass it through some kind of html fixer. I found a half
dozen websites and an open source project with one Google search
http://tidy.sourceforge.net/
On Nov 14, 2015 06:50,