The model is trained to disambiguate punctuation characters which in most cases is the period. --Guergana
-----Original Message----- From: Steven Bethard [mailto:steven.beth...@colorado.edu] Sent: Tuesday, May 21, 2013 12:07 PM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior On May 21, 2013, at 9:53 AM, "Savova, Guergana" <guergana.sav...@childrens.harvard.edu> wrote: > The OpenNLP sentence segmenter is trained on clinical data (cannot remember > exactly how many sentences were in the training corpus). This is the model > distributed with cTAKES. The only hard rule is the new line. If it's trained on clinical data, why does it need a hard rule for that? Why isn't the model able to learn when to break on a newline or not? Steve > --Guergana > > -----Original Message----- > From: Steven Bethard [mailto:steven.beth...@colorado.edu] > Sent: Tuesday, May 21, 2013 11:38 AM > To: dev@ctakes.apache.org > Subject: Re: sentence detector newline behavior > > On May 21, 2013, at 9:02 AM, Tim Miller > <timothy.mil...@childrens.harvard.edu> wrote: >> I think the whole reason to use a machine learning approach for >> sentence detection should be to help weigh evidence with these cases >> where hard rules cause problems, mainly 1) when a period does not end >> a sentence, but also 2) where a newline does and does not mean end of >> sentence. > > Perhaps we should consider re-training the OpenNLP sentence segmenter on some > clinical data? Presumably we can get sentences from the TreeBank annotations. > > I don't know much about the OpenNLP sentence segmenter though. Does it only > classify on periods? We'd want to classify all periods and newlines. And we'd > want to add features that capture patterns like "XXX: YYY". > > Steve > >> It >> is of course bad that in your example if you don't put a sentence >> break you will think that "extravascular findings" is negated. But it >> is also bad if you put a sentence break immediately after the word >> "and" at the end of a line and then you find that your language model >> thinks that "and <eos>" is a good bigram. >> >> I will create a jira for the parameter thing, and try to implement it >> and see if it gets ok results with the existing model. >> Tim >> >> On 05/21/2013 10:11 AM, Masanz, James J. wrote: >>> +1 for adding a boolean parameter, or perhaps instead a list of >>> +section IDs >>> >>> The sentence detector model was trained on data that always breaks at >>> carriage returns. >>> >>> It is important for text that is a list something like this: >>> >>> Heart Rate: normal >>> ENT: negative >>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement. >>> >>> And without breaking on the line ending, the word negative would >>> negate extravascular findings >>> >>> >>> -----Original Message----- >>> From: dev-return-1605-Masanz.James=mayo....@ctakes.apache.org >>> [mailto:dev-return-1605-Masanz.James=mayo....@ctakes.apache.org] On >>> Behalf Of Miller, Timothy >>> Sent: Tuesday, May 21, 2013 7:07 AM >>> To: dev@ctakes.apache.org >>> Subject: sentence detector newline behavior >>> >>> The sentence detector always ends a sentence where there are newlines. >>> This is a problem for some notes (e.g. MIMIC radiology notes) where >>> a line can wrap in the middle of a sentence at specified character >>> offsets. In the comments for SentenceDetector, it seems to be split >>> up very logically in that it first runs the opennlp sentence >>> detector, then breaks any detected sentence wherever there is a newline. >>> Questions: >>> 1) Would it be good to add a boolean parameter for breaking on newlines? >>> 2) If that section was removed/avoided, does the opennlp sentence >>> detector give good results given our model? Or is the model trained >>> on text that always breaks at carriage returns? >>> >>> Tim >> >