Cheers Duncan, that worked great > getURL("http://uk.youtube.com", httpheader = c("User-Agent" = "R (2.8.1)")) [1] "<!DOCTYPE HTML PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\" \"http://www.w3.org/TR/1999/REC-html401-19991224/loose.dtd\">\n\n\ [etc]
May I ask if there was a specific manual you read to learn these things please? I do not think i could have worked that one out on my own. Thank you again for your time, C.C On 27 Jan, 16:46, Duncan Temple Lang <dun...@wald.ucdavis.edu> wrote: > Some Web servers are strict. In this case, it won't accept > a request without being told who is asking, i.e. the User-Agent. > > If you use > > getURL("http://www.youtube.com", > httpheader = c("User-Agent" = "R (2.9.0)"))) > > you should get the contents of the page as expected. > > (Or with URL uk.youtube.com, etc.) > > D. > > > > clair.crossup...@googlemail.com wrote: > > Thank you. The output i get from that example is below: > > >> d = debugGatherer() > >> getURL("http://uk.youtube.com", > > + debugfunction = d$update, verbose = TRUE ) > > [1] "" > >> d$value() > > > text > > "About to connect() to uk.youtube.com port 80 (#0)\n Trying > > 208.117.236.72... connected\nConnected to uk.youtube.com > > (208.117.236.72) port 80 (#0)\nConnection #0 to host uk.youtube.com > > left intact\n" > > > headerIn > > "HTTP/1.1 400 Bad Request\r\nVia: 1.1 PFO-FIREWALL\r\nConnection: Keep- > > Alive\r\nProxy-Connection: Keep-Alive\r\nTransfer-Encoding: chunked\r > > \nExpires: Tue, 27 Apr 1971 19:44:06 EST\r\nDate: Tue, 27 Jan 2009 > > 15:31:25 GMT\r\nContent-Type: text/plain\r\nServer: Apache\r\nX- > > Content-Type-Options: nosniff\r\nCache-Control: no-cache\r > > \nCneonction: close\r\n\r\n" > > > headerOut > > "GET / HTTP/1.1\r\nHost: uk.youtube.com\r\nAccept: */*\r\n\r\n" > > > dataIn > > "0\r\n\r\n" > > > dataOut > > "" > > > So the critical information from this is the '400 Bad Request'. A > > Google search defines this for me as: > > > The request could not be understood by the server due to malformed > > syntax. The client SHOULD NOT repeat the request without > > modifications. > > > looking through sort(both listCurlOptions() and > >http://curl.haxx.se/libcurl/c/curl_easy_setopt.htm) doesn't really > > help me this time (unless i missed something). Any advice? > > > Thank you for your time, > > C.C > > > P.S. I can get the d/l to work if i use: > >> toString(readLines("http://www.uk.youtube.com")) > > [1] "<html>, \t<head>, \t\t<title>OpenDNS</title>, \t</head>, , > > \t<body id=\"mainbody\" onLoad=\"testforbanner();\" style=\"margin: > > 0px;\">, \t\t<script language=\"JavaScript\">, \t\t\tfunction > > testforbanner() {, \t\t\t\tvar width;, \t\t\t\tvar height;, \t\t\t > > \tvar x = 0;, \t\t\t\tvar isbanner = false;, \t\t\t\tvar bannersizes = > > new Array(16), \t\t\t\tbannersizes[0] = [etc] > > > On 27 Jan, 13:52, Duncan Temple Lang <dun...@wald.ucdavis.edu> wrote: > >> clair.crossup...@googlemail.com wrote: > >>> Thank you Duncan. > >>> I remember seeing in your documentation that you have used this > >>> 'verbose=TRUE' argument in functions before when trying to see what is > >>> going on. This is good. However, I have not been able to get it to > >>> work for me. Does the output appear in R or do you use some other > >>> external window (i.e. MS DOS window?)? > >> The libcurl code typically defaults to print on the console. > >> So on the Windows GUI, this will not show up. Using > >> a shell (MS DOS window or Unix-like shell) should > >> should cause the output to be displayed. > > >> A more general way however is to use the debugfunction > >> option. > > >> d = debugGatherer() > > >> getURL("http://uk.youtube.com", > >> debugfunction = d$update, verbose = TRUE) > > >> When this completes, use > > >> d$value() > > >> and you have the entire contents that would be displayed on the console. > > >> D. > > >>>> library(RCurl) > >>>> my.url <- > >>>> 'http://www.nytimes.com/2009/01/07/technology/business-computing/07pro... > >>>> getURL(my.url, verbose = TRUE) > >>> [1] "" > >>> I am having a problem with a new webpage (http://uk.youtube.com/) but > >>> if i can get this verbose to work, then i think i will be able to > >>> google the right action to take based on the information it gives. > >>> Many thanks for your time, > >>> C.C. > >>> On 26 Jan, 16:12, Duncan Temple Lang <dun...@wald.ucdavis.edu> wrote: > >>>> clair.crossup...@googlemail.com wrote: > >>>>> Dear R-help, > >>>>> There seems to be a web page I am unable to download using RCurl. I > >>>>> don't understand why it won't download: > >>>>>> library(RCurl) > >>>>>> my.url <- > >>>>>> "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..." > >>>>>> getURL(my.url) > >>>>> [1] "" > >>>> I like the irony that RCurl seems to have difficulties downloading an > >>>> article about R. Good thing it is just a matter of additional arguments > >>>> to getURL() or it would be bad news. > >>>> The followlocation parameter defaults to FALSE, so > >>>> getURL(my.url, followlocation = TRUE) > >>>> gets what you want. > >>>> The way I found this is > >>>> getURL(my.url, verbose = TRUE) > >>>> and take a look at the information being sent from R > >>>> and received by R from the server. > >>>> This gives > >>>> * About to connect() towww.nytimes.comport80(#0) > >>>> * Trying 199.239.136.200... * connected > >>>> * Connected towww.nytimes.com(199.239.136.200) port 80 (#0) > >>>> > GET /2009/01/07/technology/business-computing/07program.html?_r=2 > >>>> HTTP/1.1 > >>>> Host:www.nytimes.com > >>>> Accept: */* > >>>> < HTTP/1.1 301 Moved Permanently > >>>> < Server: Sun-ONE-Web-Server/6.1 > >>>> < Date: Mon, 26 Jan 2009 16:10:51 GMT > >>>> < Content-length: 0 > >>>> < Content-type: text/html > >>>> < > >>>> Location:http://www.nytimes.com/glogin?URI=http://www.nytimes.com/2009/01/07/t... > >>>> < > >>>> And the 301 is the critical thing here. > >>>> D. > >>>>> Other web pages are ok to download but this is the first time I have > >>>>> been unable to download a web page using the very nice RCurl package. > >>>>> While i can download the webpage using the RDCOMClient, i would like > >>>>> to understand why it doesn't work as above please? > >>>>>> library(RDCOMClient) > >>>>>> my.url <- > >>>>>> "http://www.nytimes.com/2009/01/07/technology/business-computing/07pro..." > >>>>>> ie <- COMCreate("InternetExplorer.Application") > >>>>>> txt <- list() > >>>>>> ie$Navigate(my.url) > >>>>> NULL > >>>>>> while(ie[["Busy"]]) Sys.sleep(1) > >>>>>> txt[[my.url]] <- ie[["document"]][["body"]][["innerText"]] > >>>>>> txt > >>>>> $`http://www.nytimes.com/2009/01/07/technology/business-computing/ > >>>>> 07program.html?_r=2` > >>>>> [1] "Skip to article Try Electronic Edition Log ... > >>>>> Many thanks for your time, > >>>>> C.C > >>>>> Windows Vista, running with administrator privileges. > >>>>>> sessionInfo() > >>>>> R version 2.8.1 (2008-12-22) > >>>>> i386-pc-mingw32 > >>>>> locale: > >>>>> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom. > >>>>> 1252;LC_MONETARY=English_United Kingdom. > >>>>> 1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252 > >>>>> attached base packages: > >>>>> [1] stats graphics grDevices utils datasets methods > >>>>> base > >>>>> other attached packages: > >>>>> [1] RDCOMClient_0.92-0 RCurl_0.94-0 > >>>>> loaded via a namespace (and not attached): > >>>>> [1] tools_2.8.1 > >>>>> ______________________________________________ > >>>>> r-h...@r-project.org mailing list > >>>>>https://stat.ethz.ch/mailman/listinfo/r-help > >>>>> PLEASE do read the posting > >>>>> guidehttp://www.R-project.org/posting-guide.html > >>>>> and provide commented, minimal, self-contained, reproducible code. > >>>> ______________________________________________ > >>>> r-h...@r-project.org mailing > >>>> listhttps://stat.ethz.ch/mailman/listinfo/r-help > >>>> PLEASE do read the posting > >>>> guidehttp://www.R-project.org/posting-guide.html > >>>> and provide commented, minimal, self-contained, reproducible code. > >>> ______________________________________________ > >>> r-h...@r-project.org mailing list > >>>https://stat.ethz.ch/mailman/listinfo/r-help > >>> PLEASE do read the posting > >>> guidehttp://www.R-project.org/posting-guide.html > >>> and provide commented, minimal, self-contained, reproducible code. > >> ______________________________________________ > >> r-h...@r-project.org mailing > >> listhttps://stat.ethz.ch/mailman/listinfo/r-help > >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > >> and provide commented, minimal, self-contained, reproducible code. > > > ______________________________________________ > > r-h...@r-project.org mailing list > >https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > ______________________________________________ > r-h...@r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.