Re: [R] Scraping a web page.

Duncan Temple Lang Tue, 15 May 2012 21:04:06 -0700

Hi Keith

 Of course, it doesn't necessarily matter how you get the job done
if it actually works correctly.  But for a general approach,
it is useful to use general tools and can lead to more correct,
more robust, and more maintainable code.


Since htmlParse() in the XML package can both retrieve and parse the HTML 
document
  doc = htmlParse(the.url)

is much more succinct than using curlPerform().
However, if you want to use RCurl, just use

    txt = getURLContent(the.url)

and  that replaces

  h = basicTextGatherer()
  curlPerform(url = "http://www.omegahat.org/RCurl";, writefunction = h$update)
  h$value()


If you have parsed the HTML document, you can find the <a> nodes that have an
href attribute that start with /en/Ships via

  hrefs = unlist(getNodeSet(doc, "//a[starts-with(@href, '/en/Ships')]/@href"))


The result is a character vector and you can extract the relevant substrings 
with
substring() or gsub() or any wrapper of those functions.

There are many benefits of parsing the HTML, including not falling foul of
"as far as I can tell the the <a> tag is always on it's own line" being not 
true.

    D.



On 5/15/12 4:06 AM, Keith Weintraub wrote:
> Thanks,
>   That was very helpful.
> 
> I am using readLines and grep. If grep isn't powerful enough I might end up 
> using the XML package but I hope that won't be necessary.
> 
> Thanks again,
> KW
> 
> --
> 
> On May 14, 2012, at 7:18 PM, J Toll wrote:
> 
>> On Mon, May 14, 2012 at 4:17 PM, Keith Weintraub <kw1...@gmail.com> wrote:
>>> Folks,
>>>  I want to scrape a series of web-page sources for strings like the 
>>> following:
>>>
>>> "/en/Ships/A-8605507.html"
>>> "/en/Ships/Aalborg-8122830.html"
>>>
>>> which appear in an href inside an <a> tag inside a <div> tag inside a table.
>>>
>>> In fact all I want is the (exactly) 7-digit number before ".html".
>>>
>>> The good news is that as far as I can tell the the <a> tag is always on 
>>> it's own line so some kind of line-by-line grep should suffice once I 
>>> figure out the following:
>>>
>>> What is the best package/command to use to get the source of a web page. I 
>>> tried using something like:
>>> if(url.exists("http://www.omegahat.org/RCurl";)) {
>>>  h = basicTextGatherer()
>>>  curlPerform(url = "http://www.omegahat.org/RCurl";, writefunction = 
>>> h$update)
>>>   # Now read the text that was cumulated during the query response.
>>>  h$value()
>>> }
>>>
>>> which works except that I get one long streamed html doc without the line 
>>> breaks.
>>
>> You could use:
>>
>> h <- readLines("http://www.omegahat.org/RCurl";)
>>
>> -- or --
>>
>> download.file(url = "http://www.omegahat.org/RCurl";, destfile = "tmp.html")
>> h = scan("tmp.html", what = "", sep = "\n")
>>
>> and then use grep or the XML package for processing.
>>
>> HTH
>>
>> James
> 
> 
>       [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Scraping a web page.

Reply via email to