Hi Tony -- Tony Breyal <[EMAIL PROTECTED]> writes:
> Dear R-help, > > I want to download the text from a web page, however what i end up > with is the html code. Is there some option that i am missing in the > RCurl package? Or is there another way to achieve this? This is the > code i am using: > >> library(RCurl) >> >> html.file <- getURI(my.url, ssl.verifyhost = FALSE, ssl.verifypeer = FALSE, >> followlocation = TRUE) >> print(html.file) > > I thought perhaps the htmlTreeParse() function from the XML package > might help, but I just don't know what to do next with it: > >> library(XML) >> htmlTreeParse(html.file) > > Many thanks for any help you can provide, Sounds like you're on the right track. One way is to parse the html file into its 'internal' representation, and then use xpathApply to extract relevant information (e.g., the third 'p' (paragraph) element from the XML mark-up > html = htmlTreeParse(getURL(my.url), useInternal=TRUE) Opening and ending tag mismatch: td and font Unexpected end tag : p Unexpected end tag : form > xpathApply(html, "//p[3]", xmlValue) [[1]] [1] "You can subscribe to the list, or change your existing\r\n\t subscription, in the sections below.\r\n\t" the 'xpath' is the path from the root of the document through various nested tags to tags of the specified type. "//p", says 'start at the root ('/') and look in all sub-nodes (that this '//') for an 'p' tag. ?xpathApply. is a good starting place, as is http://www.w3.org/TR/xpath, especially http://www.w3.org/TR/xpath#path-abbrev Martin > Tony Breyal > > >> sessionInfo() > R version 2.7.2 (2008-08-25) > i386-pc-mingw32 > > locale: > LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > 1252;LC_MONETARY=English_United Kingdom. > 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] XML_1.94-0 RCurl_0.9-4 > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.