On May 21, 2013, at 6:07 AM, "Miller, Timothy" 
<timothy.mil...@childrens.harvard.edu> wrote:
> The sentence detector always ends a sentence where there are newlines.
> This is a problem for some notes (e.g. MIMIC radiology notes) where a
> line can wrap in the  middle of a sentence at specified character
> offsets. In the comments for SentenceDetector, it seems to be split up
> very logically in that it first runs the opennlp sentence detector, then
> breaks any detected sentence wherever there is a newline. Questions:
> 1) Would it be good to add a boolean parameter for breaking on newlines?
> 2) If that section was removed/avoided, does the opennlp sentence
> detector give good results given our model? Or is the model trained on
> text that always breaks at carriage returns?

For what it's worth, in the ClearTK wrapper for the OpenNLP sentence detector, 
we only add extra sentences when there are *multiple* newlines in a row, i.e. 
"\\s*\\n\\s*\\n\\s*".

And it certainly seems like a good idea to me to have some way of disabling the 
"every newline is the end of a sentence" behavior. That seems like a 
particularly bad default behavior for most real text.

Steve

Reply via email to