I think the whole reason to use a machine learning approach for sentence detection should be to help weigh evidence with these cases where hard rules cause problems, mainly 1) when a period does not end a sentence, but also 2) where a newline does and does not mean end of sentence. It is of course bad that in your example if you don't put a sentence break you will think that "extravascular findings" is negated. But it is also bad if you put a sentence break immediately after the word "and" at the end of a line and then you find that your language model thinks that "and <eos>" is a good bigram.

I will create a jira for the parameter thing, and try to implement it and see if it gets ok results with the existing model.
Tim

On 05/21/2013 10:11 AM, Masanz, James J. wrote:
+1 for adding a boolean parameter, or perhaps instead a list of section IDs

The sentence detector model was trained on data that always breaks at carriage 
returns.

It is important for text that is a list something like this:

Heart Rate: normal
ENT: negative
EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.

And without breaking on the line ending, the word negative would negate 
extravascular findings


-----Original Message-----
From: dev-return-1605-Masanz.James=mayo....@ctakes.apache.org 
[mailto:dev-return-1605-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of 
Miller, Timothy
Sent: Tuesday, May 21, 2013 7:07 AM
To: dev@ctakes.apache.org
Subject: sentence detector newline behavior

The sentence detector always ends a sentence where there are newlines.
This is a problem for some notes (e.g. MIMIC radiology notes) where a
line can wrap in the  middle of a sentence at specified character
offsets. In the comments for SentenceDetector, it seems to be split up
very logically in that it first runs the opennlp sentence detector, then
breaks any detected sentence wherever there is a newline. Questions:
1) Would it be good to add a boolean parameter for breaking on newlines?
2) If that section was removed/avoided, does the opennlp sentence
detector give good results given our model? Or is the model trained on
text that always breaks at carriage returns?

Tim

Reply via email to