The model is trained to disambiguate punctuation characters which in most cases 
is the period.
--Guergana

-----Original Message-----
From: Steven Bethard [mailto:steven.beth...@colorado.edu] 
Sent: Tuesday, May 21, 2013 12:07 PM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

On May 21, 2013, at 9:53 AM, "Savova, Guergana" 
<guergana.sav...@childrens.harvard.edu> wrote:
> The OpenNLP sentence segmenter is trained on clinical data (cannot remember 
> exactly how many sentences were in the training corpus). This is the model 
> distributed with cTAKES. The only hard rule is the new line.

If it's trained on clinical data, why does it need a hard rule for that? Why 
isn't the model able to learn when to break on a newline or not?

Steve

> --Guergana
> 
> -----Original Message-----
> From: Steven Bethard [mailto:steven.beth...@colorado.edu]
> Sent: Tuesday, May 21, 2013 11:38 AM
> To: dev@ctakes.apache.org
> Subject: Re: sentence detector newline behavior
> 
> On May 21, 2013, at 9:02 AM, Tim Miller 
> <timothy.mil...@childrens.harvard.edu> wrote:
>> I think the whole reason to use a machine learning approach for 
>> sentence detection should be to help weigh evidence with these cases 
>> where hard rules cause problems, mainly 1) when a period does not end 
>> a sentence, but also 2) where a newline does and does not mean end of 
>> sentence.
> 
> Perhaps we should consider re-training the OpenNLP sentence segmenter on some 
> clinical data? Presumably we can get sentences from the TreeBank annotations.
> 
> I don't know much about the OpenNLP sentence segmenter though. Does it only 
> classify on periods? We'd want to classify all periods and newlines. And we'd 
> want to add features that capture patterns like "XXX: YYY".
> 
> Steve
> 
>> It
>> is of course bad that in your example if you don't put a sentence 
>> break you will think that "extravascular findings" is negated. But it 
>> is also bad if you put a sentence break immediately after the word 
>> "and" at the end of a line and then you find that your language model 
>> thinks that "and <eos>" is a good bigram.
>> 
>> I will create a jira for the parameter thing, and try to implement it 
>> and see if it gets ok results with the existing model.
>> Tim
>> 
>> On 05/21/2013 10:11 AM, Masanz, James J. wrote:
>>> +1 for adding a boolean parameter, or perhaps instead a list of 
>>> +section IDs
>>> 
>>> The sentence detector model was trained on data that always breaks at 
>>> carriage returns.
>>> 
>>> It is important for text that is a list something like this:
>>> 
>>> Heart Rate: normal
>>> ENT: negative
>>> EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.
>>> 
>>> And without breaking on the line ending, the word negative would 
>>> negate extravascular findings
>>> 
>>> 
>>> -----Original Message-----
>>> From: dev-return-1605-Masanz.James=mayo....@ctakes.apache.org
>>> [mailto:dev-return-1605-Masanz.James=mayo....@ctakes.apache.org] On 
>>> Behalf Of Miller, Timothy
>>> Sent: Tuesday, May 21, 2013 7:07 AM
>>> To: dev@ctakes.apache.org
>>> Subject: sentence detector newline behavior
>>> 
>>> The sentence detector always ends a sentence where there are newlines.
>>> This is a problem for some notes (e.g. MIMIC radiology notes) where 
>>> a line can wrap in the  middle of a sentence at specified character 
>>> offsets. In the comments for SentenceDetector, it seems to be split 
>>> up very logically in that it first runs the opennlp sentence 
>>> detector, then breaks any detected sentence wherever there is a newline. 
>>> Questions:
>>> 1) Would it be good to add a boolean parameter for breaking on newlines?
>>> 2) If that section was removed/avoided, does the opennlp sentence 
>>> detector give good results given our model? Or is the model trained 
>>> on text that always breaks at carriage returns?
>>> 
>>> Tim
>> 
> 

Reply via email to