The changes were mainly meant to adapt the OpenNLP model to
idiosyncrasies of clinical text, but you're right that they have some
shortcomings.

The newline thing is in the data sources used originally to build the
model, there were frequent cases of headings/sentence fragments by
themselves on a line, and _no_ cases of mid-sentence newlines. That,
combined with the fact that OpenNLP's train file format (at the time)
itself used newlines as a separator, led to the creation of that simple
rule rather than trying to retrain with newline as a candidate sentence
splitter. I created a different training file format and annotator that
does what you suggest, and built an alternative sentence splitter
model, here:
org/apache/ctakes/core/ae/SentenceDetectorAnnotatorBIO.java

it operates at the character level and splits a document into
sentences. For some people it works better. For data where there are
potentially mid-sentence newlines (like MIMIC), it is probably the only
model with usable results. It's typical failure mode is to lump two
sentences together, while the default annotator does the opposite.

Tim


On Fri, 2018-04-06 at 02:11 +0000, Ewan Mellor wrote:
> I'm looking at SentenceDetector from ctakes-core.  It has a
> surprising
> idea of what counts as a "sentence".  Before I delve any deeper,
> I wanted to ask whether there is a reason for what it's doing, in
> particular
> whether there's anything in the clinical pipeline that's depending on
> its
> behavior specifically.
> 
> The main problem I have is that it's splitting on characters like
> colon and
> semicolon, which aren't usually considered sentence separators, with
> the
> result that it often ends up tagging phrases rather than whole
> sentences.
> 
> It's using SentenceDetectorCtakes and EndOfSentenceScannerImpl, which
> seem
> to be derived from equivalents in OpenNLP, but with changes that I
> can't
> track (they date from the original edu.mayo import as far as I can
> tell).
> Other than the additional separator characters, I can't tell whether
> these
> classes are doing anything important that you wouldn't equally get
> from
> OpenNLP's SentenceDetectorME, so I don't know why they're being used.
> 
> SentenceDetector is also splitting on newlines after passing the text
> through
> the max entropy sentence model.  I don't see the point in this -- if
> you're
> going to split on newlines anyway, then why not do that before
> passing
> through the entropy model?  Or just have newline as one of the
> potential
> EOS characters and treat it as a possible break point rather than a
> definite
> one?
> 
> Any insight would be welcome.
> 
> Thanks,
> 
> Ewan.

Reply via email to