Dear Rasmus et al.:
On 2020-07-25 04:10, Rasmus Liland wrote: > On 2020-07-24 10:28 -0500, Spencer Graves wrote: >> Dear Rasmus: >> >>> Dear Spencer, >>> >>> I unified the party tables after the >>> first summary table like this: >>> >>> url <- >>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >>> M_sos <- RCurl::getURL(url) >>> saveRDS(object=M_sos, file="dcp.rds") >>> dat <- XML::readHTMLTable(M_sos) >>> idx <- 2:length(dat) >>> cn <- unique(unlist(lapply(dat[idx], colnames))) >> This is useful for this application. >> >>> dat <- do.call(rbind, >>> sapply(idx, function(i, dat, cn) { >>> x <- dat[[i]] >>> x[,cn[!(cn %in% colnames(x))]] <- NA >>> x <- x[,cn] >>> x$Party <- names(dat)[i] >>> return(list(x)) >>> }, dat=dat, cn=cn)) >>> dat[,"Date Filed"] <- >>> as.Date(x=dat[,"Date Filed"], >>> format="%m/%d/%Y") >> This misses something extremely >> important for this application:? The >> political office.? That's buried in >> the HTML or whatever it is.? I'm using >> something like the following to find >> that: >> >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) > Dear Spencer, > > I came up with a solution, but it is not > very elegant. Instead of showing you > the solution, hoping you understand > everything in it, I istead want to give > you some emphatic hints to see if you > can come up with a solution on you own. > > - XML::htmlTreeParse(M_sos) > - *Gandalf voice*: climb the tree > until you find the content you are > looking for flat out at the level of > �The Children of the Div�, *uuuUUU* > - you only want to keep the table and > header tags at this level > - Use XML::xmlValue to extract the > values of all the headers (the > political positions) > - Observe that all the tables on the > page you were able to extract > previously using XML::readHTMLTable, > are at this level, shuffled between > the political position header tags, > this means you extract the political > position and party affiliation by > using a for loop, if statements, > typeof, names, and [] and [[]] to grab > different things from the list > (content or the bag itself). > XML::readHTMLTable strips away the > line break tags from the Mailing > address, so if you find a better way > of extracting the tables, tell me, > e.g. you get > > 8805 HUNTER AVEKANSAS CITY MO 64138 > > and not > > 8805 HUNTER AVE<br/>KANSAS CITY MO 64138 > > When you've completed this �programming > quest�, you're back at the level of the > previous email, i.e. you have have the > same tables, but with political position > and party affiliation added to them. ����� Please excuse:� Before my last post, I had written code to do all that.� In brief, the political offices are "h3" tags.� I used "strsplit" to split the string at "<h3>".� I then wrote a function to find "</h3>", extract the political office and pass the rest to "XML::readHTMLTable", adding columns for party and political office. ����� However, this suppressed "<br/>" everywhere.� I thought there should be an option with something like "XML::readHTMLTable" that would not delete "<br/>" everywhere, but I couldn't find it.� If you aren't aware of one, I can gsub("<br/>", "\n", ...) on the string for each political office before passing it to "XML::readHTMLTable".� I just tested this:� It works. ����� I have other web scraping problems in my work plan for the few days.� I will definitely try XML::htmlTreeParse, etc., as you suggest. ����� Thanks again. ����� Spencer Graves > > Best, > Rasmus > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.