Have you looked at the CRAN Natural Language Processing Task View? If not,
why not? If so, why were the resources described there inadequate?

Bert

On Jul 11, 2017 10:49 AM, "Paul Miller via R-help" <r-help@r-project.org>
wrote:

> Hello All,
>
> I need some help figuring out how to extract combinations of target
> words/terms from cancer patient text medical records. I've provided some
> sample data and code below to illustrate what I'm trying to do. At the
> moment, I'm trying to extract sentences that contain the word "breast" plus
> either "metastatic" or "stage IV".
>
> It's been some time since I used R and I feel a bit rusty. I wrote a
> function called "sentence_match" that seemed to work well when applied to a
> single piece of text. You can see that by running the section titled
>
> "Working code". I thought that it might be possible easily to apply my
> function to a data set (tibble or df) but that doesn't seem to be the case.
> My unsuccessful attempt to do this appears in the section titled
> "Non-working code".
>
> If someone could help me get my code up and running, that would be greatly
> appreciated. I'm using a lot of functions from Hadley Wickham's packages,
> but that's not particularly necessary. Although I have only a few entries
> in my sample data, my actual data are pretty large. Currently, I'm working
> with over a million records. Some records contain only a single sentence,
> but many have several paragraphs. One concern I had was that, even if I
> could get my code working, it would be too inefficient to handle that
> volume of data.
>
> Thanks,
>
> Paul
>
>
> library(tidyverse)
> library(stringr)
> library(lubridate)
>
> sentence_match <- function(x){
>   sentence_extract <- str_extract_all(sampletxt, boundary("sentence"),
> simplify = TRUE)
>   sentence_number <- intersect(str_which(sentence_extract, "breast"),
> str_which(sentence_extract, "metastatic|stage IV"))
>   sentence_match <- str_c(sentence_number, ": ", 
> sentence_extract[sentence_number],
> collapse = "")
>   sentence_match
> }
>
> #### Working code ####
>
> sampletxt <- "This sentence contains the word metastatic and the word
> breast. This sentence contains no target words."
>
> sentence_match(sampletxt)
>
> #### Non-working code ####
>
> sampletxt <-
>   structure(
>     list(
>       PTNO = c(1, 2, 2, 2),
>       DATE = structure(c(16436, 16436, 16832, 16845), class = "Date"),
>       TYPE = c("Progress note", "CAT scan", "Progress note", "Progress
> note"),
>       TVAR = c(
>         "This sentence contains the word metastatic. This sentence
> contains the term stage IV.",
>         "This sentence contains no target words. This sentence also
> contains no target words.",
>         "This sentence contains the word metastatic and the word breast.
> This sentence contains no target words.",
>         "This sentence contains the words breast and the term metastatic.
> This
> sentence contains the word breast and the term stage IV."
>       )
>     ),
>     .Names = c("PTNO", "DATE", "TYPE", "TVAR"),
>     class = c("tbl_df",
>               "tbl", "data.frame"),
>     row.names = c(NA,-4L)
>   )
>
> sampletxt2 <- group_by_at(sampletxt, vars(PTNO, DATE, TYPE))
> sampletxt2 <-
>   sampletxt2 %>%
>   mutate(
>     EXTRACTED = sentence_match(TVAR)
>   )
>
> ______________________________________________
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to