On May 21, 2013, at 6:07 AM, "Miller, Timothy" <timothy.mil...@childrens.harvard.edu> wrote: > The sentence detector always ends a sentence where there are newlines. > This is a problem for some notes (e.g. MIMIC radiology notes) where a > line can wrap in the middle of a sentence at specified character > offsets. In the comments for SentenceDetector, it seems to be split up > very logically in that it first runs the opennlp sentence detector, then > breaks any detected sentence wherever there is a newline. Questions: > 1) Would it be good to add a boolean parameter for breaking on newlines? > 2) If that section was removed/avoided, does the opennlp sentence > detector give good results given our model? Or is the model trained on > text that always breaks at carriage returns?
For what it's worth, in the ClearTK wrapper for the OpenNLP sentence detector, we only add extra sentences when there are *multiple* newlines in a row, i.e. "\\s*\\n\\s*\\n\\s*". And it certainly seems like a good idea to me to have some way of disabling the "every newline is the end of a sentence" behavior. That seems like a particularly bad default behavior for most real text. Steve