Staying in R, the XML package in conjunction with the XPATH query language is likely to be your friend.
> library(XML) > html=htmlTreeParse("http://www.wunderground.com/global/stations/16239.html", > useInternal=TRUE) > xpathApply(html, "//[EMAIL PROTECTED]'tempf' and + @pwsid='LIRA']/@value", xmlValue) [[1]] [1] "63" see http://www.w3.org/TR/xpath especially http://www.w3.org/TR/xpath#path-abbrev for xpath hints. Martin Daniel Folkinshteyn <[EMAIL PROTECTED]> writes: > i know this is an R mailing list :) but... i'll recommend you try > python with the beautifulsoup module - makes html processing a cinch. > > another thing to note is that wunderground provides very handy RSS > feeds for every location, so rather than parsing the html page (with > it's associated bundles of gunk), you'd have a better time parsing the > RSS feed. (there are some rss parsing libraries for python, too, but > in your simple case it may be simpler to just extract stuff manually > with some well-placed regexps) > > so use python to pull that out, and append to a nice tab-delimited > file, and then in your R process just read from that file. > > on 06/05/2008 04:45 PM Nutter, Benjamin said the following: >> I've tried to tackle a similar question at the request of a coworker. >> Unfortunately, it is difficult to read in HTML code because it lacks >> character that can consistently be used as a delimiter. The only >> guideline I can offer is that any text you're interested in is going to >> be between a ">" and a "<". So the goal is to eliminate anything >> between < and >. >> What's more, if you really want to read in HTML code, you'll need a >> good >> grasp on HTML itself, and some familiarity with how the code you're >> reading in is structured. For instance, I'm attaching code that I wrote >> to read in HTML tables that were generated by other functions commonly >> used in my work place. But my code assumes that the tables are written >> by row (using the <tr> tag. >> Essentially, after studying the code I was going to read in, I hand >> picked the markers that I could use to isolate the text I wanted. I >> then proceeded to play a game of Simon Says to break down the code to >> smaller and smaller pieces until I got what I wanted. Unless you're >> going to be doing this a lot, I wouldn't recommend taking >> the time to try and write a function like this. In most cases it's >> probably faster just to copy the data by hand. But if you are >> determined to make it work, I hope the ideas help. >> Benjamin >> -----Original Message----- >> From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] >> On Behalf Of vittorio >> Sent: Wednesday, June 04, 2008 3:50 PM >> To: [EMAIL PROTECTED] >> Subject: [Possible SPAM] [R] Reading selected lines in an .html file >> Dear friend, In an R program running permanently on a server I would >> like to read >> hour by hour the temperature in *C and the humidity from a site >> like this >> (actually, from many of such sites): >> http://www.wunderground.com/global/stations/16239.html >> How can I read the content of the site and select the info I need? >> Ciao >> Vittorio >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> =================================== >> P Please consider the environment before printing this e-mail >> Cleveland Clinic is ranked one of the top hospitals >> in America by U.S. News & World Report (2007). Visit us online at >> http://www.clevelandclinic.org for >> a complete listing of our services, staff and >> locations. >> Confidentiality Note: This message is intended for use >> only by the individual or entity to which it is addressed >> and may contain information that is privileged, >> confidential, and exempt from disclosure under applicable >> law. If the reader of this message is not the intended >> recipient or the employee or agent responsible for >> delivering the message to the intended recipient, you are >> hereby notified that any dissemination, distribution or >> copying of this communication is strictly prohibited. If >> you have received this communication in error, please >> contact the sender immediately and destroy the material in >> its entirety, whether electronic or hard copy. Thank you. >> ------------------------------------------------------------------------ >> ______________________________________________ >> R-help@r-project.org mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.