[R] Complex text parsing task

Paul Miller Mon, 21 May 2012 08:33:41 -0700

Hello Everyone,

I have what I think is a complex text parsing task. I've provided some sample 
data below. There's a relatively simple version of the coding that needs to be 
done and a more complex version. If someone could help me out with either 
version, I'd greatly appreciate it.


Here are my sample data.

haveData <- 
structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L, 
3L, 3L, 4L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c("001-001 ", 
"001-002 ", "001-003 ", "001-004 ", "001-005 ", "001-006 ", "001-007 "
), class = "factor"), encounter_date = structure(c(9L, 10L, 11L, 
12L, 13L, 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c(" 2009-03-01 
", 
" 2009-03-22 ", " 2009-04-01 ", " 2010-03-01 ", " 2010-10-15 ", 
" 2010-11-15 ", " 2011-03-01 ", " 2011-03-14 ", " 2011-10-10 ", 
" 2011-10-24 ", " 2012-09-15 ", " 2012-10-05 ", " 2012-10-17 "
), class = "factor"), raw = structure(c(9L, 12L, 16L, 13L, 10L, 
7L, 6L, 3L, 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c(" ... If patient 
KRAS result is wild type, they will start Erbitux. ... (Several lines of 
material) ... Ordered KRAS mutation test 11/11/2011. Results are still not 
available. ... ", 
" ... KRAS (mutated). Therefore did not prescribe Erbitux. ... ", 
" ... KRAS (mutated). Will not prescribe Erbitux due to mutation. ... ", 
" ... KRAS (Wild). ...", " ... KRAS results are in. Patient has the mutation. 
... ", 
" ... KRAS results still pending. Note that patient was negative for Lynch 
mutation. ...", 
" ... KRAS test results pending. Note that patient was negative for Lynch 
mutation. ...", 
" ... Ordered KRAS mutation testing on 02/15/2011. Results came back negative. 
... (Several lines of material) ... Patient KRAS mutation test is negative. 
Will start Erbitux. ...", 
" ... Ordered KRAS testing on 10/10/2010. Results not yet available. If patient 
has a mutaton, will start Erbitux. ...", 
" ... Ordered KRAS testing. Waiting for results. ...", " ... Patient is KRAS 
negative. Started Erbitux on 03/01/2011. ...", 
" ... Received KRAS results on 10/20/2010. Test results indicate tumor is wild 
type. Ua Protein positve. ER/PR positive. HER2/neu positve. ...", 
" ... Still need to order KRAS mutation testing. ... ", " ... Tumor is negative 
for KRAS mutation. ...", 
" ... Tumor is wild type. Patient is eligible to receive Eribtux. ...", 
" ... Will conduct KRAS mutation testing prior to initiation of therapy with 
Erbitux. ..."
), class = "factor")), .Names = c("profile_key", "encounter_date", 
"raw"), row.names = c(NA, -16L), class = "data.frame")

The following code displays the results of so-called "simple" coding.

#### Simple coding ####

KRASpatient <- c("001-001", "001-002", "001-003", "001-004", "001-005", 
"001-006",  "001-007")
KRAStested <- c(2,3,2,2,2,3,3)
KRASwild <- c(1,0,2,0,3,1,3)
KRASmutant <- c(4,2,2,3,1,2,2)
simpleData <- data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) 
simpleData

Here, KRAStested is calculated by summing all references to "KRAS" for each 
patient. Wild is calculated by summing all references to "wild type", "wild", 
and "negative" that come within 20 words of the closest reference to KRAS. 
Mutant is calculated by summing all references to "mutant", "mutated", and 
"positive" that occur within 20 words of the closest reference to KRAS.   

The second kind of coding is what I'm referring to as "complex coding".  The 
following code displays the results of this type of coding.

#### Complex coding ####

KRAStested <- c(2,1,0,2,2,2,3)
KRASwild <- c(1,0,0,0,3,0,3)
KRASmutant <- c(0,0,0,3,0,1,0)
complexData <- data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) 
complexData

The results of "complex coding" differ substantially from those obtained under 
"simple coding" and I think illustrate the potential problems with that 
approach. With "complex coding", the goal would be to identify and sum only 
true references to KRAS testing and true references to the result of that 
testing (either wild type/negative or mutant/positive).

True references to KRAS testing would be identified using a set of qualifiers 
that eliminate the false references. So, for example, one of the patients in my 
(made up) sample data has the phrase "Will conduct KRAS mutation testing prior 
to initiation of therapy with Erbitux" in their medical record. In this case, 
"Will" is a qualifier that indicates this is not a true reference to KRAS 
testing. For this exercise, other qualifiers related to KRAS testing would 
include "need", "order" (but not the past tense "ordered"), "wait", "waiting", 
"await", and "awaiting".
To be a qualifier, these terms would need to occur within 12 words of the 
closest true reference to KRAS.

True references to the results of testing would also be identified using a set 
of qualifiers that eliminate false references. Here the list of qualifiers 
would include "if", "lynch", "kras mutation test", "kras mutation testing" and 
"for kras mutation". Qualifiers would need to come within 12 words of a true 
reference to KRAS testing.

There's an additional wrinkle for identifying true references to the results of 
testing. One also needs to take into account the presence of what I'm calling 
"nullifiers". For purposes of this exercise, nullfiers include "Ua Protein", 
"ER/PR", and "HER2/neu" If "positive" or "negative" come closer to one of these 
words than to a true reference to KRAS, then they should not be used to 
identify the results of KRAS testing. 

Help with either type of coding would be greatly appreciated.

Thanks,

Paul

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Complex text parsing task

Reply via email to