[R] rvest and the not css selector

James Toll Thu, 15 Oct 2015 14:30:56 -0700

Hi,

I'm trying to use rvest to scrape a page and I am having difficulty excluding 
child element superscripts via a CSS selector.  For example, here I've read the 
html and selected nodes.



p <- read_html(targetUrl)
p %>% html_nodes("td.xyz")


The result looks something like this:

{xml_nodeset (20)}
 [1] <td class="xyz" width="50%">Foo<font size="-1"><sup>9</sup></font>:</td>
 [2] <td class="xyz" width="50%">Bar<font size="-1"><sup>3</sup></font>:</td>
[...]


I would like to extract the words "Foo" and "Bar" without the superscripts by 
passing along to html_text().  I thought something like this would work, but it 
returns just the superscripts. 

p %>% 
html_nodes("td.xyz") %>%
html_nodes(":not(sup)") %>% 
html_text()


Perhaps I’m using the not selector improperly.  Any suggestions on how to get 
this to work properly?  Thanks.


James

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] rvest and the not css selector

Reply via email to