I think the whole reason to use a machine learning approach for sentence
detection should be to help weigh evidence with these cases where hard
rules cause problems, mainly 1) when a period does not end a sentence,
but also 2) where a newline does and does not mean end of sentence. It
is of course bad that in your example if you don't put a sentence break
you will think that "extravascular findings" is negated. But it is also
bad if you put a sentence break immediately after the word "and" at the
end of a line and then you find that your language model thinks that
"and <eos>" is a good bigram.
I will create a jira for the parameter thing, and try to implement it
and see if it gets ok results with the existing model.
Tim
On 05/21/2013 10:11 AM, Masanz, James J. wrote:
+1 for adding a boolean parameter, or perhaps instead a list of section IDs
The sentence detector model was trained on data that always breaks at carriage
returns.
It is important for text that is a list something like this:
Heart Rate: normal
ENT: negative
EXTRAVASCULAR FINDINGS: Severe prostatic enlargement.
And without breaking on the line ending, the word negative would negate
extravascular findings
-----Original Message-----
From: dev-return-1605-Masanz.James=mayo....@ctakes.apache.org
[mailto:dev-return-1605-Masanz.James=mayo....@ctakes.apache.org] On Behalf Of
Miller, Timothy
Sent: Tuesday, May 21, 2013 7:07 AM
To: dev@ctakes.apache.org
Subject: sentence detector newline behavior
The sentence detector always ends a sentence where there are newlines.
This is a problem for some notes (e.g. MIMIC radiology notes) where a
line can wrap in the middle of a sentence at specified character
offsets. In the comments for SentenceDetector, it seems to be split up
very logically in that it first runs the opennlp sentence detector, then
breaks any detected sentence wherever there is a newline. Questions:
1) Would it be good to add a boolean parameter for breaking on newlines?
2) If that section was removed/avoided, does the opennlp sentence
detector give good results given our model? Or is the model trained on
text that always breaks at carriage returns?
Tim