The sentence detection opennlp model used by ctakes does not split sentences at newlines - there is additional logic in the takes sentence splitter that does this (and an alternative impl that doesn't is in the ytex branch). Afaik no retraining / change to the feature representation is necessary.
Vj On Monday, January 20, 2014, Jörn Kottmann <kottm...@gmail.com> wrote: > Hi all, > > currently I have quite a bit of time to work on OpenNLP, and would like to > help you > out with this issue. > > Here is the follow up issue for this change: > https://issues.apache.org/jira/browse/OPENNLP-602 > > I am still trying to figure out what would be the best option to implement > this. > In the training data a user could just use a special tag to identify the > chars. > > Instead of <NEWLINE> it might be better to use <CR> and <LF> to encode > these two chars > in the training data. Any thoughts? > > I am planning to release this as part of OpenNLP 1.6.0. > > Thanks, > Jörn > > On 05/22/2013 02:03 PM, Jörn Kottmann wrote: > >> On 05/22/2013 01:17 PM, Miller, Timothy wrote: >> >>> That's awesome! It might be worth trying at least. How does the training >>> process change? Previously the training data would be one sentence per >>> line, but with newlines as possible mid-sentence characters that could >>> be trouble, is there a new representation for training data? Or would we >>> have to use the training api? >>> >> >> Good point, yes that will be a problem with the default training format, >> but it shouldn't be hard >> to solve. In the format itself we could define a new line tag e.g. >> <NEWLINE> to mark new lines. >> as a hack to make it work with 1.5.3 you could instead use a special char >> as a replacement >> for the new line char. >> When you pass the text down to the sentence detector a simple string >> replace could be used to >> convert all new line chars to the special new line marker char. >> >> If things work out for you performance wise as well we will just >> integrate it properly into OpenNLP >> for the next release. >> >> Could you produce a sentence detector training file with a new line >> marker char? >> >> You should try to pick a char you can also pass in on a terminal >> otherwise you have to use the >> API to train the model. The build in cross validation could be used to >> evaluate the performance. >> >> Jörn >> > >