Hi,

I want to mine web pages and decided to use tm and scrapeR. The example 
given in scrapeR's manual runs as follows:

library(scrapeR)
pageSource<-scrape(url="http://cran.r-project.org/web/packages/",headers=TRUE, 
parse=FALSE)
if(attributes(pageSource)$headers["status"]==200) {
   page<-scrape(object="pageSource")
   xpathSApply(page,"//table//td/a",xmlValue)
} else {
   cat("There was an error with the page. \n")
}

and returns a list and an error

str(pageSource) gives

List of 1
  $ http://cran.r-project.org/web/packages/: atomic [1:1] <!DOCTYPE html 
PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd";>
<html xmlns="http://www.w3.org/1999/xhtml";>
.
## I have left out most of the html that was returned.
.
   ..- attr(*, "headers")= Named chr "<!DOCTYPE html PUBLIC 
\"-//W3C//DTD XHTML 1.0 Strict//EN\" 
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\";>\n<html xmlns="| 
__truncated__
   .. ..- attr(*, "names")= chr "<!DOCTYPE html PUBLIC \"-//W3C//DTD 
XHTML 1.0 Strict//EN\" 
\"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd\";>\n<html xmlns="| 
__truncated__

I seem to be missing the status attribute in the returned list and the 
scrape(object="pageSource") returns a list causing xpathSApply indigestion!

I am running R 2.15.3 (2013-03-01) on Ubuntu 12.04 with RCurl 1.95-4.1 
and libcurl4-gnutls-dev (version 7.22.0-3ubuntu4.1) and libcurl3 
(version 7.22.0-3ubuntu4.1). RCurl's basicHeaderGather() function 
returns a status of 200 for 
http://cran.r-project.org/web/packages/index.html

I assume I have a problem with my libcurl setup....... Any pointers to 
fixing this?

Andrew

Andrew Roberts
Oswestry UK

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to