Hi Nick, Can you elaborate (hopefully in a constructive way) on what it is that you find objectionable about my post?
Thanks, Paul --- On Mon, 5/21/12, Nick Gayeski <n...@wildfishconservancy.org> wrote: > From: Nick Gayeski <n...@wildfishconservancy.org> > Subject: RE: [R] Complex text parsing task > To: "'Paul Miller'" <pjmiller...@yahoo.com>, r-help@r-project.org > Received: Monday, May 21, 2012, 10:36 AM > Please stop sending these emails! > > > -----Original Message----- > From: r-help-boun...@r-project.org > [mailto:r-help-boun...@r-project.org] > On > Behalf Of Paul Miller > Sent: Monday, May 21, 2012 8:32 AM > To: r-help@r-project.org > Subject: [R] Complex text parsing task > > Hello Everyone, > > I have what I think is a complex text parsing task. I've > provided some > sample data below. There's a relatively simple version of > the coding that > needs to be done and a more complex version. If someone > could help me out > with either version, I'd greatly appreciate it. > > Here are my sample data. > > haveData <- > structure(list(profile_key = structure(c(1L, 1L, 2L, 2L, 2L, > 3L, 3L, 4L, 4L, > 5L, 5L, 5L, 6L, 6L, 7L, 7L), .Label = c("001-001 ", > "001-002 ", "001-003 ", "001-004 ", "001-005 ", "001-006 ", > "001-007 " > ), class = "factor"), encounter_date = structure(c(9L, 10L, > 11L, 12L, 13L, > 5L, 6L, 7L, 8L, 1L, 2L, 3L, 4L, 4L, 7L, 7L), .Label = c(" > 2009-03-01 ", " > 2009-03-22 ", " 2009-04-01 ", " 2010-03-01 ", " 2010-10-15 > ", " 2010-11-15 > ", " 2011-03-01 ", " 2011-03-14 ", " 2011-10-10 ", " > 2011-10-24 ", " > 2012-09-15 ", " 2012-10-05 ", " 2012-10-17 " > ), class = "factor"), raw = structure(c(9L, 12L, 16L, 13L, > 10L, 7L, 6L, 3L, > 2L, 4L, 14L, 15L, 1L, 5L, 8L, 11L), .Label = c(" ... If > patient KRAS result > is wild type, they will start Erbitux. ... (Several lines of > material) ... > Ordered KRAS mutation test 11/11/2011. Results are still not > available. ... > ", " ... KRAS (mutated). Therefore did not prescribe > Erbitux. ... ", " ... > KRAS (mutated). Will not prescribe Erbitux due to mutation. > ... ", " ... > KRAS (Wild). ...", " ... KRAS results are in. Patient has > the mutation. ... > ", " ... KRAS results still pending. Note that patient was > negative for > Lynch mutation. ...", " ... KRAS test results pending. Note > that patient was > negative for Lynch mutation. ...", " ... Ordered KRAS > mutation testing on > 02/15/2011. Results came back negative. ... (Several lines > of material) ... > Patient KRAS mutation test is negative. Will start Erbitux. > ...", " ... > Ordered KRAS testing on 10/10/2010. Results not yet > available. If patient > has a mutaton, will start Erbitux. ...", " ... Ordered KRAS > testing. Waiting > for results. ...", " ... Patient is KRAS negative. Started > Erbitux on > 03/01/2011. ...", " ... Received KRAS results on 10/20/2010. > Test results > indicate tumor is wild type. Ua Protein positve. ER/PR > positive. HER2/neu > positve. ...", " ... Still need to order KRAS mutation > testing. ... ", " ... > Tumor is negative for KRAS mutation. ...", " ... Tumor is > wild type. Patient > is eligible to receive Eribtux. ...", " ... Will conduct > KRAS mutation > testing prior to initiation of therapy with Erbitux. ..." > ), class = "factor")), .Names = c("profile_key", > "encounter_date", "raw"), > row.names = c(NA, -16L), class = "data.frame") > > The following code displays the results of so-called > "simple" coding. > > #### Simple coding #### > > KRASpatient <- c("001-001", "001-002", "001-003", > "001-004", "001-005", > "001-006", "001-007") KRAStested <- > c(2,3,2,2,2,3,3) KRASwild <- > c(1,0,2,0,3,1,3) KRASmutant <- c(4,2,2,3,1,2,2) > simpleData <- > data.frame(KRASpatient, KRAStested, KRASwild, KRASmutant) > simpleData > > Here, KRAStested is calculated by summing all references to > "KRAS" for each > patient. Wild is calculated by summing all references to > "wild type", > "wild", and "negative" that come within 20 words of the > closest reference to > KRAS. Mutant is calculated by summing all references to > "mutant", "mutated", > and "positive" that occur within 20 words of the closest > reference to KRAS. > > > The second kind of coding is what I'm referring to as > "complex coding". The > following code displays the results of this type of coding. > > #### Complex coding #### > > KRAStested <- c(2,1,0,2,2,2,3) > KRASwild <- c(1,0,0,0,3,0,3) > KRASmutant <- c(0,0,0,3,0,1,0) > complexData <- data.frame(KRASpatient, KRAStested, > KRASwild, KRASmutant) > complexData > > The results of "complex coding" differ substantially from > those obtained > under "simple coding" and I think illustrate the potential > problems with > that approach. With "complex coding", the goal would be to > identify and sum > only true references to KRAS testing and true references to > the result of > that testing (either wild type/negative or > mutant/positive). > > True references to KRAS testing would be identified using a > set of > qualifiers that eliminate the false references. So, for > example, one of the > patients in my (made up) sample data has the phrase "Will > conduct KRAS > mutation testing prior to initiation of therapy with > Erbitux" in their > medical record. In this case, "Will" is a qualifier that > indicates this is > not a true reference to KRAS testing. For this exercise, > other qualifiers > related to KRAS testing would include "need", "order" (but > not the past > tense "ordered"), "wait", "waiting", "await", and > "awaiting". > To be a qualifier, these terms would need to occur within 12 > words of the > closest true reference to KRAS. > > True references to the results of testing would also be > identified using a > set of qualifiers that eliminate false references. Here the > list of > qualifiers would include "if", "lynch", "kras mutation > test", "kras mutation > testing" and "for kras mutation". Qualifiers would need to > come within 12 > words of a true reference to KRAS testing. > > There's an additional wrinkle for identifying true > references to the results > of testing. One also needs to take into account the presence > of what I'm > calling "nullifiers". For purposes of this exercise, > nullfiers include "Ua > Protein", "ER/PR", and "HER2/neu" If "positive" or > "negative" come closer to > one of these words than to a true reference to KRAS, then > they should not be > used to identify the results of KRAS testing. > > Help with either type of coding would be greatly > appreciated. > > Thanks, > > Paul > > ______________________________________________ > R-help@r-project.org > mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible > code. > > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.