On 2020-07-25 09:56 -0500, Spencer Graves wrote: > Dear Rasmus et al.: It is LILAND et al., is it not? I do not belong to a large Confucian family structure (putting the hunter-gatherer horse-rider tribe name first in all-caps in the email), else it's customary to put a comma in there, isn't it? ... right, moving on:
On 2020-07-25 04:10, Rasmus Liland wrote: > > ????? It might be a better idea to write the reply in plain-text utf-8 or at least Western or Eastern-European ISO euro encoding instead of us-ascii (maybe KOI8, ¯\_(ツ)_/¯) ... something in your email got string-replaced by "?????" and also "«" got replaced by "?". Please research using Thunderbird, Claws mail, or some other sane e-mail client; they are great, I promise. > Please excuse:? Before my last post, I > had written code to do all that.? Good! > In brief, the political offices are > "h3" tags.? Yes, some type of header element at least, in-between the various tables, everything children of the div in the element tree. > I used "strsplit" to split the string > at "<h3>".? I then wrote a > function to find "</h3>", extract the > political office and pass the rest to > "XML::readHTMLTable", adding columns > for party and political office. Yes, doing that for the political office is also possible, but the party is inside the table's caption tag, which end up as the name of the table in the XML::readHTMLTable list ... > However, this suppressed "<br/>" > everywhere.? Why is that, please explain. > I thought there should be > an option with something like > "XML::readHTMLTable" that would not > delete "<br/>" everywhere, but I > couldn't find it.? No, there is not, AFAIK. Please, if anyone else knows, please say so *echoes in the forest* > If you aren't aware of one, I can > gsub("<br/>", "\n", ...) on the string > for each political office before > passing it to "XML::readHTMLTable".? I > just tested this:? It works. Such a great hack! IMHO, this is much more flexible than using xml2::read_html, rvest::read_table, dplyr::mutate like here[1] > I have other web scraping problems in > my work plan for the few days.? Maybe, idk ... > I will definitely try > XML::htmlTreeParse, etc., as you > suggest. I wish you good luck, Rasmus [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells
signature.asc
Description: PGP signature
______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.