Sorry, I can't generate an error when running those commands in R on Linux
64-bit. But if I move to Windows (R version 3.0.1, XML_3.98-1.1), I get a
different error ...

> require(XML)
Loading required package: XML
> doc <- htmlTreeParse("
http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
")
> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
Input is not proper UTF-8, indicate encoding !
Bytes: 0xC2 0x0A 0x20 0x20
Error: 1: Input is not proper UTF-8, indicate encoding !
Bytes: 0xC2 0x0A 0x20 0x20
> node <- getNodeSet(doc, "//link[@rel='alternate']" )
Error in UseMethod("xpathApply") :
  no applicable method for 'xpathApply' applied to an object of class
"XMLDocumentContent"

... note that I've tried both doc[[1]] and doc in the function call. Also,
only the XML library is required. I'm not sure what's going on with the
character encoding error, might be my system settings. Reading the help
page (?htmlTreeParse) provides a clue to use the htmlParse function
instead, equivalent to setting the useInternalNodes parameter to TRUE ...
"These can then be searched using XPath expressions via 'xpathApply' and
'getNodeSet'." That seems to be relevant to this case.

> doc <- htmlParse("
http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
")
> node <- xpathSApply(doc, "//link[@rel='alternate']", xmlAttrs)
> node

[,1]

rel
"alternate"

type
"application/atom+xml"

title
"ATOM"

href
"/cgi-bin/browse-edgar?action=getcompany&CIK=0000789019&type=&dateb=&owner=exclude&count=40&output=atom"
> strsplit(strsplit(node[[4]], "CIK=")[[1]][2], "&type")[[1]][1]
[1] "0000789019"

Perhaps that approach is less prone to error.


On Thu, Aug 15, 2013 at 12:48 PM, Sparks, John James <jspa...@uic.edu>wrote:

> Thanks so much for looking into this for me.
>
> Unfortunately, I get an error when I execute your code.  Is there a
> library that you loaded that I haven't?
>
> require(scrapeR)
> require(XML)
> require(RCurl)
> doc<-htmlTreeParse("
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> ")
> node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
> Error in UseMethod("xpathApply") :
>   no applicable method for 'xpathApply' applied to an object of class
> "character"
>
>
> Guidance would be much appreciated.
>
> --JJS
>
>
>
> On Wed, August 14, 2013 4:19 am, Jeffrey Dick wrote:
> > Hi,
> >
> > There are many occurrences of the CIK number in the page source. This
> > pulls
> > out the first node containing it:
> >
> > node <- getNodeSet(doc[[1]], "//link[@rel='alternate']" )
> >
> > From there you can extract the number. Here's one way to do it.
> >
> > strsplit(strsplit(unlist(node)[[5]], "CIK=")[[1]][2], "&type")[[1]][1]
> >
> > Jeff
> >
> >
> > On Wed, Aug 14, 2013 at 1:34 PM, Sparks, John James <jspa...@uic.edu>
> > wrote:
> >
> >> Dear R Helpers,
> >>
> >> I would like to pull the CIK number from the web page
> >>
> >>
> >>
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> >>
> >> If you put this web page into your browser you will see the CIK number
> >> in
> >> red on the left side of the page near the top.
> >>
> >> When I try the basic
> >> require(scrapeR)
> >> require(XML)
> >> require(RCurl)
> >> doc
> >> <-htmlTreeParse("
> >>
> http://www.sec.gov/cgi-bin/browse-edgar?CIK=MSFT&Find=Search&owner=exclude&action=getcompany
> >> ")
> >> str(doc)
> >>
> >> I get a large number of items in the data frame that I don't know how to
> >> interpret.  Both
> >> tables <- readHTMLTable(doc)
> >>
> >> and
> >>
> >> list<-xmlToList(doc)
> >>
> >> result in errors.
> >>
> >> Any (positive) guidance would be much appreciated.
> >>
> >> --John J. Sparks, Ph.D.
> >>
> >> ______________________________________________
> >> R-help@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide
> >> http://www.R-project.org/posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
>
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to