[R] Scraping a web page.

Keith Weintraub Mon, 14 May 2012 14:19:12 -0700

Folks,
  I want to scrape a series of web-page sources for strings like the following:


"/en/Ships/A-8605507.html"
"/en/Ships/Aalborg-8122830.html"

which appear in an href inside an <a> tag inside a <div> tag inside a table.

In fact all I want is the (exactly) 7-digit number before ".html".

The good news is that as far as I can tell the the <a> tag is always on it's 
own line so some kind of line-by-line grep should suffice once I figure out the 
following:

What is the best package/command to use to get the source of a web page. I 
tried using something like:
if(url.exists("http://www.omegahat.org/RCurl";)) {
  h = basicTextGatherer()
  curlPerform(url = "http://www.omegahat.org/RCurl";, writefunction = h$update)
   # Now read the text that was cumulated during the query response.
  h$value()
}

which works except that I get one long streamed html doc without the line 
breaks.


Thanks in advance for your help,
KW


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Scraping a web page.

Reply via email to