I presume the combination turned out to perform the best in the past...? (based on James and Guergana's enum/medication examples) Having a flag to turn off the hard newline rule seems reasonable if it works. My 1/2 cent... (short of having to preprocess MIMIC Radiology formatted notes or retraining?) --Pei
> -----Original Message----- > From: Steven Bethard [mailto:steven.beth...@colorado.edu] > Sent: Tuesday, May 21, 2013 12:07 PM > To: dev@ctakes.apache.org > Subject: Re: sentence detector newline behavior > > On May 21, 2013, at 9:53 AM, "Savova, Guergana" > <guergana.sav...@childrens.harvard.edu> wrote: > > The OpenNLP sentence segmenter is trained on clinical data (cannot > remember exactly how many sentences were in the training corpus). This is > the model distributed with cTAKES. The only hard rule is the new line. > > If it's trained on clinical data, why does it need a hard rule for that? Why > isn't > the model able to learn when to break on a newline or not? > > Steve > > > --Guergana > > > > -----Original Message----- > > From: Steven Bethard [mailto:steven.beth...@colorado.edu] > > Sent: Tuesday, May 21, 2013 11:38 AM > > To: dev@ctakes.apache.org > > Subject: Re: sentence detector newline behavior > > > > On May 21, 2013, at 9:02 AM, Tim Miller > <timothy.mil...@childrens.harvard.edu> wrote: > >> I think the whole reason to use a machine learning approach for > >> sentence detection should be to help weigh evidence with these cases > >> where hard rules cause problems, mainly 1) when a period does not end > >> a sentence, but also 2) where a newline does and does not mean end of > sentence. > > > > Perhaps we should consider re-training the OpenNLP sentence segmenter > on some clinical data? Presumably we can get sentences from the TreeBank > annotations. > > > > I don't know much about the OpenNLP sentence segmenter though. Does > it only classify on periods? We'd want to classify all periods and newlines. > And > we'd want to add features that capture patterns like "XXX: YYY". > > > > Steve > > > >> It > >> is of course bad that in your example if you don't put a sentence > >> break you will think that "extravascular findings" is negated. But it > >> is also bad if you put a sentence break immediately after the word > >> "and" at the end of a line and then you find that your language model > >> thinks that "and <eos>" is a good bigram. > >> > >> I will create a jira for the parameter thing, and try to implement it > >> and see if it gets ok results with the existing model. > >> Tim > >> > >> On 05/21/2013 10:11 AM, Masanz, James J. wrote: > >>> +1 for adding a boolean parameter, or perhaps instead a list of > >>> +section IDs > >>> > >>> The sentence detector model was trained on data that always breaks at > carriage returns. > >>> > >>> It is important for text that is a list something like this: > >>> > >>> Heart Rate: normal > >>> ENT: negative > >>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement. > >>> > >>> And without breaking on the line ending, the word negative would > >>> negate extravascular findings > >>> > >>> > >>> -----Original Message----- > >>> From: dev-return-1605-Masanz.James=mayo....@ctakes.apache.org > >>> [mailto:dev-return-1605-Masanz.James=mayo....@ctakes.apache.org] > On > >>> Behalf Of Miller, Timothy > >>> Sent: Tuesday, May 21, 2013 7:07 AM > >>> To: dev@ctakes.apache.org > >>> Subject: sentence detector newline behavior > >>> > >>> The sentence detector always ends a sentence where there are > newlines. > >>> This is a problem for some notes (e.g. MIMIC radiology notes) where > >>> a line can wrap in the middle of a sentence at specified character > >>> offsets. In the comments for SentenceDetector, it seems to be split > >>> up very logically in that it first runs the opennlp sentence > >>> detector, then breaks any detected sentence wherever there is a > newline. Questions: > >>> 1) Would it be good to add a boolean parameter for breaking on > newlines? > >>> 2) If that section was removed/avoided, does the opennlp sentence > >>> detector give good results given our model? Or is the model trained > >>> on text that always breaks at carriage returns? > >>> > >>> Tim > >> > >