Folks, I want to scrape a series of web-page sources for strings like the following:
"/en/Ships/A-8605507.html" "/en/Ships/Aalborg-8122830.html" which appear in an href inside an <a> tag inside a <div> tag inside a table. In fact all I want is the (exactly) 7-digit number before ".html". The good news is that as far as I can tell the the <a> tag is always on it's own line so some kind of line-by-line grep should suffice once I figure out the following: What is the best package/command to use to get the source of a web page. I tried using something like: if(url.exists("http://www.omegahat.org/RCurl")) { h = basicTextGatherer() curlPerform(url = "http://www.omegahat.org/RCurl", writefunction = h$update) # Now read the text that was cumulated during the query response. h$value() } which works except that I get one long streamed html doc without the line breaks. Thanks in advance for your help, KW [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.