Thanks, the second approach worked fine on Windows. --JJS
On Thu, August 15, 2013 8:38 am, Jeffrey Dick wrote: > Sorry, I can't generate an error when running those commands in R on Linux > 64-bit. But if I move to Windows (R version 3.0.1, XML_3.98-1.1), I get a > different error ... > >> require(XML) > Loading required package: XML >> doc <- htmlTreeParse(" > http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany > ") >> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" ) > Input is not proper UTF-8, indicate encoding ! > Bytes: 0xC2 0x0A 0x20 0x20 > Error: 1: Input is not proper UTF-8, indicate encoding ! > Bytes: 0xC2 0x0A 0x20 0x20 >> node <- getNodeSet(doc, "//link[@rel='alternate']" ) > Error in UseMethod("xpathApply") : > no applicable method for 'xpathApply' applied to an object of class > "XMLDocumentContent" > > ... note that I've tried both doc[[1]] and doc in the function call. Also, > only the XML library is required. I'm not sure what's going on with the > character encoding error, might be my system settings. Reading the help > page (?htmlTreeParse) provides a clue to use the htmlParse function > instead, equivalent to setting the useInternalNodes parameter to TRUE ... > "These can then be searched using XPath expressions via 'xpathApply' and > 'getNodeSet'." That seems to be relevant to this case. > >> doc <- htmlParse(" > http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany > ") >> node <- xpathSApply(doc, "//link[@rel='alternate']", xmlAttrs) >> node > > [,1] > > rel > "alternate" > > type > "application/atom+xml" > > title > "ATOM" > > href > "/cgi-bin/browse-edgar?action=getcompany&CIK=0000789019&type=&dateb=&owner=exclude&count=40&output=atom" >> strsplit(strsplit(node[[4]], "CIK=")[[1]][2], "&type")[[1]][1] > [1] "0000789019" > > Perhaps that approach is less prone to error. > > > On Thu, Aug 15, 2013 at 12:48 PM, Sparks, John James > <jspa...@uic.edu>wrote: > >> Thanks so much for looking into this for me. >> >> Unfortunately, I get an error when I execute your code. Is there a >> library that you loaded that I haven't? >> >> require(scrapeR) >> require(XML) >> require(RCurl) >> doc<-htmlTreeParse(" >> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany >> ") >> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" ) >> Error in UseMethod("xpathApply") : >> no applicable method for 'xpathApply' applied to an object of class >> "character" >> >> >> Guidance would be much appreciated. >> >> --JJS >> >> >> >> On Wed, August 14, 2013 4:19 am, Jeffrey Dick wrote: >> > Hi, >> > >> > There are many occurrences of the CIK number in the page source. This >> > pulls >> > out the first node containing it: >> > >> > node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" ) >> > >> > From there you can extract the number. Here's one way to do it. >> > >> > strsplit(strsplit(unlist(node)[[5]], "CIK=")[[1]][2], "&type")[[1]][1] >> > >> > Jeff >> > >> > >> > On Wed, Aug 14, 2013 at 1:34 PM, Sparks, John James <jspa...@uic.edu> >> > wrote: >> > >> >> Dear R Helpers, >> >> >> >> I would like to pull the CIK number from the web page >> >> >> >> >> >> >> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany >> >> >> >> If you put this web page into your browser you will see the CIK >> number >> >> in >> >> red on the left side of the page near the top. >> >> >> >> When I try the basic >> >> require(scrapeR) >> >> require(XML) >> >> require(RCurl) >> >> doc >> >> <-htmlTreeParse(" >> >> >> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany >> >> ") >> >> str(doc) >> >> >> >> I get a large number of items in the data frame that I don't know how >> to >> >> interpret. Both >> >> tables <- readHTMLTable(doc) >> >> >> >> and >> >> >> >> list<-xmlToList(doc) >> >> >> >> result in errors. >> >> >> >> Any (positive) guidance would be much appreciated. >> >> >> >> --John J. Sparks, Ph.D. >> >> >> >> ______________________________________________ >> >> R-help@r-project.org mailing list >> >> https://stat.ethz.ch/mailman/listinfo/r-help >> >> PLEASE do read the posting guide >> >> http://www.R-project.org/posting-guide.html >> >> and provide commented, minimal, self-contained, reproducible code. >> >> >> > >> >> >> > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.