I was thinking the same thing as Steve. Thats a pretty regular onc physical exam, why not just split sentences with regex's off a small list of defined onc physical exam terms? The interesting case would be breast, as this term may appear in the body of a sentence (rather than just a term), but u could use a regex sub match where u conditionally match breast first then one or more key physical findings to correctly identify THAT breast word token as the term, eg beginning of the sentence. I would recommend red flag physical findings as they are more likely to always been in the body of the sentence, for example, Breast: no lumps or masses palpable.
I have a few other ideas if thats barking up the right tree. JG — Sent from Mailbox for iPhone On Sat, Aug 2, 2014 at 8:58 AM, Steven Bethard <steven.beth...@gmail.com> wrote: > On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy > <timothy.mil...@childrens.harvard.edu> wrote: >> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear >> to auscultation CV: regular rate and rhythm without murmur or gallop , S1, >> S2 normal, no murmur, click, rub or gal*, chest is clear without rales or >> wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative >> findings right/left breast with mild swelling, warmth, mild erythema, >> slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender. >> >> It would be preferable to me to put sentence breaks in between the sections, >> so the first two sentences would be: >> >> 1) PE: Lymphonodes... >> 2) Lungs: normal... > [snip] >> Another example that breaks our model in a different way (truncated): >> 1. Baseline labwork including tumor markers 2. Start DD AC on Friday 8/1 >> with RN chemo teach 3. S U parent study > [snip] >> Here it would be preferable to get: >> 1. >> Baseline labwork... >> 2. >> Start DD... >> 3. >> S U parent study > Seems like rather than specifying a set of "candidate characters", we > want to specify a candidate boundary regular expression. Something > like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases: > sentence boundaries may appear at punctuation marks, at uppercase > letters after word boundaries, and at numbers after a word boundaries. > Steve