Hello all, I changed the subject line of the e-mail, because the question I''m posing now is different than the first one. I hope that this is proper etiquette. However, the original chain is included below.
I've incorporated bits of both Ethan and Brian's code into the script below, but there's one aspect I can't get my head around. I'm totally new to programming with control structures. The reproducible code below creates a list containing 19 data frames, one each for the "Most Important Problem" survey data for Canada. What I'd like at this stage is a loop where I can search through all the data frames for rows containing the search term and then bind the rows together in a plotable (sp?) format. At the bottom of the code below, you'll find my first attempt to make use of a search string and to put it into a plotable format. It only partially works. I can only get the numbers for one year, where I'd like to be able to get a string of numbers for several years.But, on the upside, grep appears to do the trick in terms of selecting rows. Can any one suggest a solution? Yours truly, Simon Kiss #This is the reproducible code to set-up all the data frames require("XML") library(XML) #This gets the data from the web and lists them mylist <- paste ("http://www.queensu.ca/cora/_trends/mip_", c(1987:2001,2003:2006), ".htm", sep="") alltables <- lapply(mylist, readHTMLTable) #convert to dataframes r<-lapply(alltables, function(x) {as.data.frame(x)} ) #This is just some house-cleaning; structuring all the tables so they are uniform r[[1]][3]<-r[[1]][2] r[[1]][2]<-c(" ") r[[2]][4]<-r[[2]][2] r[[2]][5]<-r[[2]][3] r[[2]][2:3]<-c(" ") r[[3]][4:5]<-r[[3]][3:4] r[[3]][3]<-c(" ") #This loop deletes some superfluous columns and rows, turns the first column in to character strings and the data into numeric for (i in 1:19) { n.rows<-dim(r[[i]])[1] r[[i]] <- r[[i]][15:n.rows-3, 1:5] n.rows<-dim(r[[i]])[1] row.names(r[[i]]) <-NULL names(r[[i]]) <- c("Response", "Q1", "Q2", "Q3", "Q4") r[[i]][, 1]<-as.character(r[[i]][,1]) #r[[i]][,2:5]<-as.numeric(as.character(r[[i]][,2:5])) r[[i]][, 2:5]<-lapply(r[[i]][, 2:5], function(x) {as.numeric(as.character(x))}) #n.rows<-dim(r[[i]])[1] #r[[i]]<-r[[i]][9 } #This code is my first attempt at introducing a search string, getting the rows, binding and plotting; economy<-r[[10]][grep('Economy', r[[10]][,1]),] economy_2<-r[[11]][grep('Economy', r[[11]][,1]),] test<-cbind(economy, economy_2) plot(as.numeric(test), type='l') #here's another attempt I'm trying.... economy<-data.frame for (i in 15:19) { economy[i,] <-r[[i]][grep('Economy', r[[i]][,1]), ] } Begin forwarded message: > From: Simon Kiss <sjk...@gmail.com> > Date: October 7, 2010 4:59:46 PM EDT > To: Simon Kiss <simonjk...@yahoo.ca> > Subject: Fwd: [R] Converting scraped data > > > > Begin forwarded message: > >> From: Ethan Brown <ethancbr...@gmail.com> >> Date: October 6, 2010 4:22:41 PM GMT-04:00 >> To: Simon Kiss <sjk...@gmail.com> >> Cc: r-help@r-project.org >> Subject: Re: [R] Converting scraped data >> >> Hi Simon, >> >> You'll notice the "test" data.frame has a whole mix of characters in >> the columns you're interested, including a "-" for missing values, and >> that the columns you're interested in are in fact factors. >> >> as.numeric(factor) returns the level of the factor, not the value of >> the level. (See ?levels and ?factor)--that's why it's giving you those >> irrelevant integers. I always end up using something like this handy >> code snippet to deal with the situation: >> >> unfactor <- function(factors) >> # From http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor >> # Transform a factor back into its factor names >> { >> return(levels(factors)[factors]) >> } >> >> Then, to get your data to where you want it, I'd do this: >> >> require(XML) >> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm" >> tables <- readHTMLTable(theurl) >> n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) >> class(tables) >> test<-data.frame(tables, stringsAsFactors=FALSE) >> >> >> result <- test[11:42, 1:5] #Extract the actual data we want >> names(result) <- c("Response", "Q1", "Q2","Q3","Q4") >> for(i in 2:5) { >> # Convert columns to factors >> result[,i] <- as.numeric(unfactor(result[,i])) >> } >> result >> >> From here you should be able to plot or do whatever else you want. >> >> Hope this helps, >> Ethan Brown >> >> >> On Wed, Oct 6, 2010 at 9:52 AM, Simon Kiss <sjk...@gmail.com> wrote: >>> Dear Colleagues, >>> I used this code to scrape data from the URL conatined within. This code >>> should be reproducible. >>> >>> require("XML") >>> library(XML) >>> theurl <- "http://www.queensu.ca/cora/_trends/mip_2006.htm" >>> tables <- readHTMLTable(theurl) >>> n.rows <- unlist(lapply(tables, function(t) dim(t)[1])) >>> class(tables) >>> test<-data.frame(tables, stringsAsFactors=FALSE) >>> test[16,c(2:5)] >>> as.numeric(test[16,c(2:5)]) >>> quartz() >>> plot(c(1:4), test[15, c(2:5)]) >>> >>> calling the values from the row of interest using test[16, c(2:5)] can bring >>> them up as represented on the screen, plotting them or coercing them to >>> numeric changes the values and in a way that doesn't make sense to me. My >>> intuitino is that there is something going on with the way the characters >>> are coded or classed when they're scraped into R. I've looked around the >>> help files for converting from character to numeric but can't find a >>> solution. >>> >>> I also tried this: >>> >>> as.numeric(as.character(test[16,c(2:5)] and that also changed the values >>> from what they originally were. >>> >>> I'm grateful for any suggestions. >>> Yours, Simon Kiss >>> >>> >>> >>> ********************************* >>> Simon J. Kiss, PhD >>> Assistant Professor, Wilfrid Laurier University >>> 73 George Street >>> Brantford, Ontario, Canada >>> N3T 2C9 >>> Cell: +1 519 761 7606 >>> >>> ______________________________________________ >>> R-help@r-project.org mailing list >>> https://stat.ethz.ch/mailman/listinfo/r-help >>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >>> and provide commented, minimal, self-contained, reproducible code. >>> > > ********************************* > Simon J. Kiss, PhD > Assistant Professor, Wilfrid Laurier University > 73 George Street > Brantford, Ontario, Canada > N3T 2C9 > Cell: +1 519 761 7606 > > > > > > > > > > ********************************* Simon J. Kiss, PhD Assistant Professor, Wilfrid Laurier University 73 George Street Brantford, Ontario, Canada N3T 2C9 Cell: +1 519 761 7606 ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.