Hello All,

I need some help figuring out how to extract combinations of target words/terms 
from cancer patient text medical records. I've provided some sample data and 
code below to illustrate what I'm trying to do. At the moment, I'm trying to 
extract sentences that contain the word "breast" plus either "metastatic" or 
"stage IV". 

It's been some time since I used R and I feel a bit rusty. I wrote a function 
called "sentence_match" that seemed to work well when applied to a single piece 
of text. You can see that by running the section titled 

"Working code". I thought that it might be possible easily to apply my function 
to a data set (tibble or df) but that doesn't seem to be the case. My 
unsuccessful attempt to do this appears in the section titled "Non-working 
code". 

If someone could help me get my code up and running, that would be greatly 
appreciated. I'm using a lot of functions from Hadley Wickham's packages, but 
that's not particularly necessary. Although I have only a few entries in my 
sample data, my actual data are pretty large. Currently, I'm working with over 
a million records. Some records contain only a single sentence, but many have 
several paragraphs. One concern I had was that, even if I could get my code 
working, it would be too inefficient to handle that volume of data. 

Thanks,

Paul


library(tidyverse)
library(stringr)
library(lubridate)
 
sentence_match <- function(x){
  sentence_extract <- str_extract_all(sampletxt, boundary("sentence"), simplify 
= TRUE)
  sentence_number <- intersect(str_which(sentence_extract, "breast"), 
str_which(sentence_extract, "metastatic|stage IV"))
  sentence_match <- str_c(sentence_number, ": ", 
sentence_extract[sentence_number], collapse = "")
  sentence_match
}
 
#### Working code ####
 
sampletxt <- "This sentence contains the word metastatic and the word breast. 
This sentence contains no target words."

sentence_match(sampletxt)
 
#### Non-working code ####
 
sampletxt <-
  structure(
    list(
      PTNO = c(1, 2, 2, 2),
      DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"),
      TYPE = c("Progress note", "CAT scan", "Progress note", "Progress note"),
      TVAR = c(
        "This sentence contains the word metastatic. This sentence contains the 
term stage IV.",
        "This sentence contains no target words. This sentence also contains no 
target words.",
        "This sentence contains the word metastatic and the word breast. This 
sentence contains no target words.",
        "This sentence contains the words breast and the term metastatic. This 
sentence contains the word breast and the term stage IV."
      )
    ),
    .Names = c("PTNO", "DATE", "TYPE", "TVAR"),
    class = c("tbl_df",
              "tbl", "data.frame"),
    row.names = c(NA,-4L)
  )
  
sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE))
sampletxt2 <- 
  sampletxt2 %>%
  mutate(
    EXTRACTED = sentence_match(TVAR)
  )

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to