In the clinical narrative there are many sections that are enumerations and 
where a new line character must be treated as a sentence break. For example, 
Current Medications in which each line contains a medication and its signature.

The format of the MIMIC notes is a bit strange as there are many new line 
characters in the middle of the sentences which is imposed by the native 
application the notes were created in (cannot remember the name of the app) 
which has a character window and then a new line is inserted at the end of that 
window. I believe we have a pre-processing script that deals with this issue.
--Guergana

-----Original Message-----
From: Steven Bethard [mailto:steven.beth...@colorado.edu] 
Sent: Tuesday, May 21, 2013 9:59 AM
To: dev@ctakes.apache.org
Subject: Re: sentence detector newline behavior

On May 21, 2013, at 6:07 AM, "Miller, Timothy" 
<timothy.mil...@childrens.harvard.edu> wrote:
> The sentence detector always ends a sentence where there are newlines.
> This is a problem for some notes (e.g. MIMIC radiology notes) where a 
> line can wrap in the  middle of a sentence at specified character 
> offsets. In the comments for SentenceDetector, it seems to be split up 
> very logically in that it first runs the opennlp sentence detector, 
> then breaks any detected sentence wherever there is a newline. Questions:
> 1) Would it be good to add a boolean parameter for breaking on newlines?
> 2) If that section was removed/avoided, does the opennlp sentence 
> detector give good results given our model? Or is the model trained on 
> text that always breaks at carriage returns?

For what it's worth, in the ClearTK wrapper for the OpenNLP sentence detector, 
we only add extra sentences when there are *multiple* newlines in a row, i.e. 
"\\s*\\n\\s*\\n\\s*".

And it certainly seems like a good idea to me to have some way of disabling the 
"every newline is the end of a sentence" behavior. That seems like a 
particularly bad default behavior for most real text.

Steve

Reply via email to