On Sun, 2011-03-06 at 08:06 -0500, Mike Marchywka wrote: > > > > > > ---------------------------------------- > > Date: Thu, 3 Mar 2011 13:04:11 -0600 > > From: matt.shotw...@vanderbilt.edu > > To: r-help@r-project.org > > Subject: Re: [R] Developing a web crawler / R "webkit" or something > > similar? [off topic] > > > > On 03/03/2011 08:07 AM, Mike Marchywka wrote: > > > > > > > > > > > > > > > > > > > > > > > >> Date: Thu, 3 Mar 2011 01:22:44 -0800 > > >> From: antuj...@gmail.com > > >> To: r-help@r-project.org > > >> Subject: [R] Developing a web crawler > > >> > > >> Hi, > > >> > > >> I wish to develop a web crawler in R. I have been using the > > >> functionalities > > >> available under the RCurl package. > > >> I am able to extract the html content of the site but i don't know how > > >> to go > > > > > > In general this can be a big effort but there may be things in > > > text processing packages you could adapt to execute html and javascript. > > > However, I guess what I'd be looking for is something like a "webkit" > > > package or other open source browser with or without an "R" interface. > > > This actually may be an ideal solution for a lot of things as you get > > > all the content handlers of at least some browser. > > > > > > > > > Now that you mention it, I wonder if there are browser plugins to handle > > > "R" content ( I'd have to give this some thought, put a script up as > > > a web page with mime type "test/R" and have it execute it in R. ) > > > > There are server-side solutions for this sort of thing. See > > http://rapache.net/ . Also, there was a string of messages on R-devel > > some years ago addressing the mime type issue; beginning here: > > http://tolstoy.newcastle.edu.au/R/devel/05/11/3054.html . Though I don't > > know whether there was a resolution. Some suggestions were text/x-R, > > text/x-Rd, application/x-RData. > > > The rapache demo looks like something I could use right away > but I haven't looked into the handlers yet. I have installed rapache now > on my debian system ( still have config issues but I did get apach2 to > restart LOL) > Before I plow into this too far, how would this compare/compete with something > like a PHP library for Rserve? That is the approach I had been pursuing. > > Thanks.
Hi Mike, If you've built and configured RApache, then the difficult "plowing" is over :). RApache operates at the top (HTTP) layer of the OSI stack, whereas Rserve works at the lower transport/network layer. Hence, the scope of Rserve applications is far more general. Extending Rserve to operate at the HTTP layer (via PHP) will mean more work. RApache offers high level functionality, for example, to replace PHP with R in web pages. No interface code is necessary. Here's a simple "What's The Time?" webpage using RApache and yarr [1] to handle the code: << setContentType("text/html\n\n") >> <html> <head><title>What's The Time?</title></head> <body><pre><</= cat(format(Sys.time(), usetz=TRUE)) >></pre></body> </html> Here's a live version: [2]. Interfacing PHP with Rserve in this context would be useful if installation of R and/or RApache on the web host were prohibited. A PHP/Rserve framework might also be useful in other contexts, for example, to extend PHP applications (e.g. WordPress, MediaWiki). Best, Matt [1] http://biostatmatt.com/archives/1000 [2] http://biostatmatt.com/yarr/time.yarr > > > -Matt > > > > > > > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.