When I wrote something to do page scraping I used mostly string tricks to reduce the source to a list of what I wanted. Below is the whole script, and the different handlers let me get the text, images, videos, or links from a page:
global gPageURL function getText pPageSource put replaceText(pPageSource,"(?:<(?P<tag>script|style)[\s\S]*?</(?P=tag)>)|(?:<!--[\s\S]*?-->)|(?:<[\s\S]*?>)","") into pPageSource replace lf with "" in pPageSource replace tab with " " in pPageSource return pPageSource end getText function getMedia pPageSource put ".jpg,.png,.gif,.jpeg,.mov,.mp4,m4v,.mp3" into tExtensions repeat with a = 1 to the number of items in tExtensions put item a of tExtensions into tExtension replace tExtension with tExtension & "*" & return in pPageSource end repeat repeat with a = the number of lines in pPageSource down to 1 put line a of pPageSource into tLine if the last char of tLine is "*" then delete the last char of tLine put removeLeaders(gPageURL,tLine) into line a of pPageSource else delete line a of pPageSource end if end repeat return pPageSource end getMedia function removeLeaders pPageURL,pLinkURL put quote&"'()" into tDelimiters repeat with a = 1 to the number of chars in tDelimiters put char a of tDelimiters into tDelimiter set the itemdelimiter to tDelimiter put the last item of pLinkURL into pLinkURL end repeat return getPath(pPageURL,pLinkURL) end removeLeaders function getLinks pPageSource replace "/a>" with "/a>" & return in pPageSource replace "<a" with return & "<a" in pPageSource filter pPageSource with "*a href*" & quote & "http*/a>" set the itemdelimiter to quote repeat with a = 1 to the number of lines in pPageSource put getPath(gPageURL,item 2 of line a of pPageSource) into line a of pPageSource end repeat sort pPageSource put line 1 of pPageSource into tLinks repeat with a = 2 to the number of lines in pPageSource if line a of pPageSource <> the last line of tLinks then put return & line a of pPageSource after tLinks end if end repeat return tLinks end getLinks function getPath pPageURL,pLinkURL if pLinkURL contains "://" then return pLinkURL end if set the itemdelimiter to "/" if char 1 of pLinkURL is "/" then return item 1 to 3 of pPageURL & pLinkURL else if char 1 to 3 of pLinkURL is "../" then delete the last item of pPageURL delete the last item of pPageURL return pPageURL & char 3 to 999 of pLinkURL else delete the last item of pPageURL return pPageURL & "/" & pLinkURL end if end if return "" end getPath on showMedia pMediaFile if there is an image "mediaImage" then delete image "mediaImage" set the itemdelimiter to "." switch (the last item of pMediaFile) case "png" case "gif" case "jpg" case "jpeg" new image set the name of image the number of images to "mediaImage" set the filename of image "mediaImage" to pMediaFile break case "mp4" case "m4v" case "mov" case "mp3" set the showController of the templatePlayer to true play video pMediaFile break end switch end showMedia _______________________________________________ use-livecode mailing list use-livecode@lists.runrev.com Please visit this url to subscribe, unsubscribe and manage your subscription preferences: http://lists.runrev.com/mailman/listinfo/use-livecode