Very pleased to see so many people offer suggestions! Comparing some of these different methods might make an interesting student project.
Sean: > Just an fyi. Does that make sense? Haven't had my coffee ... Makes perfect sense, the downside is it requires some kind of higher level understanding during sentence segmentation to understand what the hierarchy is. You could imagine something that looks similar but with a different logical structure. Long term, some big joint model that does all things simultaneously is definitely something I'm interested in. Steve: > Seems like rather than specifying a set of "candidate characters", we > want to specify a candidate boundary regular expression. This might be something that would be possible with minimal changes to the model. John: > why not just split sentences with regex's off a small list of defined onc > physical exam terms? My preference for vanilla ctakes is always to do basic linguistic things like tokenization and sentence segmentation without reference to context-specific rules, just because it makes them less portable. Obviously for specific use cases or applications (like what Britt is probably doing) you will use whatever information makes sense for your domain. But I think we could get maybe 75% of the remaining cases (which are probably only 5% of the total # of cases) by using smarter boundary conditions like Steve suggested. Thanks again, Tim On 08/02/2014 01:26 PM, John Green wrote: > I was thinking the same thing as Steve. Thats a pretty regular onc physical > exam, why not just split sentences with regex's off a small list of defined > onc physical exam terms? The interesting case would be breast, as this term > may appear in the body of a sentence (rather than just a term), but u could > use a regex sub match where u conditionally match breast first then one or > more key physical findings to correctly identify THAT breast word token as > the term, eg beginning of the sentence. I would recommend red flag physical > findings as they are more likely to always been in the body of the sentence, > for example, Breast: no lumps or masses palpable. > > > I have a few other ideas if thats barking up the right tree. > > > > > JG > — > Sent from Mailbox for iPhone > > On Sat, Aug 2, 2014 at 8:58 AM, Steven Bethard <steven.beth...@gmail.com> > wrote: > >> On Sat, Aug 2, 2014 at 7:43 AM, Miller, Timothy >> <timothy.mil...@childrens.harvard.edu> wrote: >>> PE: Lymphnodes: neck and axilla without adenopathy Lungs: normal and clear >>> to auscultation CV: regular rate and rhythm without murmur or gallop , S1, >>> S2 normal, no murmur, click, rub or gal*, chest is clear without rales or >>> wheezing, no pedal edema, no JVD, no hepatosplenomegaly Breast: negative >>> findings right/left breast with mild swelling, warmth, mild erythema, >>> slightly tender, no seroma or hematoma Abdomen: Abdomen soft, non-tender. >>> >>> It would be preferable to me to put sentence breaks in between the >>> sections, so the first two sentences would be: >>> >>> 1) PE: Lymphonodes... >>> 2) Lungs: normal... >> [snip] >>> Another example that breaks our model in a different way (truncated): >>> 1. Baseline labwork including tumor markers 2. Start DD AC on Friday 8/1 >>> with RN chemo teach 3. S U parent study >> [snip] >>> Here it would be preferable to get: >>> 1. >>> Baseline labwork... >>> 2. >>> Start DD... >>> 3. >>> S U parent study >> Seems like rather than specifying a set of "candidate characters", we >> want to specify a candidate boundary regular expression. Something >> like, \p{P}|\b\p{Lu}|\b\p{N}, should cover all of the above cases: >> sentence boundaries may appear at punctuation marks, at uppercase >> letters after word boundaries, and at numbers after a word boundaries. >> Steve