In the clinical narrative there are many sections that are enumerations and where a new line character must be treated as a sentence break. For example, Current Medications in which each line contains a medication and its signature.
The format of the MIMIC notes is a bit strange as there are many new line characters in the middle of the sentences which is imposed by the native application the notes were created in (cannot remember the name of the app) which has a character window and then a new line is inserted at the end of that window. I believe we have a pre-processing script that deals with this issue. --Guergana -----Original Message----- From: Steven Bethard [mailto:steven.beth...@colorado.edu] Sent: Tuesday, May 21, 2013 9:59 AM To: dev@ctakes.apache.org Subject: Re: sentence detector newline behavior On May 21, 2013, at 6:07 AM, "Miller, Timothy" <timothy.mil...@childrens.harvard.edu> wrote: > The sentence detector always ends a sentence where there are newlines. > This is a problem for some notes (e.g. MIMIC radiology notes) where a > line can wrap in the middle of a sentence at specified character > offsets. In the comments for SentenceDetector, it seems to be split up > very logically in that it first runs the opennlp sentence detector, > then breaks any detected sentence wherever there is a newline. Questions: > 1) Would it be good to add a boolean parameter for breaking on newlines? > 2) If that section was removed/avoided, does the opennlp sentence > detector give good results given our model? Or is the model trained on > text that always breaks at carriage returns? For what it's worth, in the ClearTK wrapper for the OpenNLP sentence detector, we only add extra sentences when there are *multiple* newlines in a row, i.e. "\\s*\\n\\s*\\n\\s*". And it certainly seems like a good idea to me to have some way of disabling the "every newline is the end of a sentence" behavior. That seems like a particularly bad default behavior for most real text. Steve