Folks,
I want to scrape a series of web-page sources for strings like the following:
"/en/Ships/A-8605507.html"
"/en/Ships/Aalborg-8122830.html"
which appear in an href inside an <a> tag inside a <div> tag inside a table.
In fact all I want is the (exactly) 7-digit number before ".html".
The good news is that as far as I can tell the the <a> tag is always on it's
own line so some kind of line-by-line grep should suffice once I figure out the
following:
What is the best package/command to use to get the source of a web page. I
tried using something like:
if(url.exists("http://www.omegahat.org/RCurl")) {
h = basicTextGatherer()
curlPerform(url = "http://www.omegahat.org/RCurl", writefunction = h$update)
# Now read the text that was cumulated during the query response.
h$value()
}
which works except that I get one long streamed html doc without the line
breaks.
Thanks in advance for your help,
KW
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.