Have you looked at the CRAN Natural Language Processing Task View? If not, why not? If so, why were the resources described there inadequate?
Bert On Jul 11, 2017 10:49 AM, "Paul Miller via R-help" <r-help@r-project.org> wrote: > Hello All, > > I need some help figuring out how to extract combinations of target > words/terms from cancer patient text medical records. I've provided some > sample data and code below to illustrate what I'm trying to do. At the > moment, I'm trying to extract sentences that contain the word "breast" plus > either "metastatic" or "stage IV". > > It's been some time since I used R and I feel a bit rusty. I wrote a > function called "sentence_match" that seemed to work well when applied to a > single piece of text. You can see that by running the section titled > > "Working code". I thought that it might be possible easily to apply my > function to a data set (tibble or df) but that doesn't seem to be the case. > My unsuccessful attempt to do this appears in the section titled > "Non-working code". > > If someone could help me get my code up and running, that would be greatly > appreciated. I'm using a lot of functions from Hadley Wickham's packages, > but that's not particularly necessary. Although I have only a few entries > in my sample data, my actual data are pretty large. Currently, I'm working > with over a million records. Some records contain only a single sentence, > but many have several paragraphs. One concern I had was that, even if I > could get my code working, it would be too inefficient to handle that > volume of data. > > Thanks, > > Paul > > > library(tidyverse) > library(stringr) > library(lubridate) > > sentence_match <- function(x){ > sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), > simplify = TRUE) > sentence_number <- intersect(str_which(sentence_extract, "breast"), > str_which(sentence_extract, "metastatic|stage IV")) > sentence_match <- str_c(sentence_number, ": ", > sentence_extract[sentence_number], > collapse = "") > sentence_match > } > > #### Working code #### > > sampletxt <- "This sentence contains the word metastatic and the word > breast. This sentence contains no target words." > > sentence_match(sampletxt) > > #### Non-working code #### > > sampletxt <- > structure( > list( > PTNO = c(1, 2, 2, 2), > DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"), > TYPE = c("Progress note", "CAT scan", "Progress note", "Progress > note"), > TVAR = c( > "This sentence contains the word metastatic. This sentence > contains the term stage IV.", > "This sentence contains no target words. This sentence also > contains no target words.", > "This sentence contains the word metastatic and the word breast. > This sentence contains no target words.", > "This sentence contains the words breast and the term metastatic. > This > sentence contains the word breast and the term stage IV." > ) > ), > .Names = c("PTNO", "DATE", "TYPE", "TVAR"), > class = c("tbl_df", > "tbl", "data.frame"), > row.names = c(NA,-4L) > ) > > sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE)) > sampletxt2 <- > sampletxt2 %>% > mutate( > EXTRACTED = sentence_match(TVAR) > ) > > ______________________________________________ > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/ > posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.