Dear Rasmus:
On 2020-07-24 09:16, Rasmus Liland wrote: > On 2020-07-24 08:20 -0500, luke-tier...@uiowa.edu wrote: >> On Fri, 24 Jul 2020, Spencer Graves wrote: >>> On 2020-07-23 17:46, William Michels wrote: >>>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves >>>> <spencer.gra...@effectivedefense.org> wrote: >>>>> Hello, All: >>>>> >>>>> I've failed with multiple >>>>> attempts to scrape the table of >>>>> candidates from the website of >>>>> the Missouri Secretary of >>>>> State: >>>>> >>>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 >>>> Hi Spencer, >>>> >>>> I tried the code below on an older >>>> R-installation, and it works fine. >>>> Not a full solution, but it's a >>>> start: >>>> >>>>> library(RCurl) >>>> Loading required package: bitops >>>>> url <- >>>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >>>>> M_sos <- getURL(url) >>> Hi Bill et al.: >>> >>> That broke the dam:� It gave me a >>> character vector of length 1 >>> consisting of 218 KB.� I fed that to >>> XML::readHTMLTable and >>> purrr::map_chr, both of which >>> returned lists of 337 data.frames. >>> The former retained names for all >>> the tables, absent from the latter. >>> The columns of the former are all >>> character;� that's not true for the >>> latter. >>> >>> Sadly, it's not quite what I want: >>> It's one table for each office-party >>> combination, but it's lost the >>> office designation. However, I'm >>> confident I can figure out how to >>> hack that. >> Maybe try something like this: >> >> url <- >> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" >> h <- xml2::read_html(url) >> tbl <- rvest::html_table(h) > Dear Spencer, > > I unified the party tables after the > first summary table like this: > > url <- > "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" > M_sos <- RCurl::getURL(url) > saveRDS(object=M_sos, file="dcp.rds") > dat <- XML::readHTMLTable(M_sos) > idx <- 2:length(dat) > cn <- unique(unlist(lapply(dat[idx], colnames))) ����� This is useful for this application. > dat <- do.call(rbind, > sapply(idx, function(i, dat, cn) { > x <- dat[[i]] > x[,cn[!(cn %in% colnames(x))]] <- NA > x <- x[,cn] > x$Party <- names(dat)[i] > return(list(x)) > }, dat=dat, cn=cn)) > dat[,"Date Filed"] <- > as.Date(x=dat[,"Date Filed"], > format="%m/%d/%Y") ����� This misses something extremely important for this application:� The political office.� That's buried in the HTML or whatever it is.� I'm using something like the following to find that: str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]]) ����� After I figure this out, I will use something like your code to combine it all into separate tables for each office, and then probably combine those into one table for the offices I'm interested in.� For my present purposes, I don't want all the offices in Missouri, only the executive positions and those representing parts of the Kansas City metro area in the Missouri legislature. ����� Thanks again, ����� Spencer Graves > write.table(dat, file="dcp.tsv", sep="\t", > row.names=FALSE, > quote=TRUE, na="N/A") > > Best, > Rasmus > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]]
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.