Thanks for the replies, both of you.  I have filed
https://issues.apache.org/jira/browse/CTAKES-507 with a patch to put
those details in the class comments.

I'm using double-newline as a paragraph separator, with mid-sentence single
newlines (the text is coming from an OCR phase so it's broken into lines).
I'm getting reasonable results now, by adding a custom annotator first that
adds Segment annotations for each of the paragraphs, and then using
SentenceDetectorAnnotatorBIO.  I then don't need ParagraphSentenceFixer
because my sentences can't span paragraphs as SentenceDetectorAnnotatorBIO
only sees one paragraph at a time.

I did find one place where it failed -- using the default tokenCounts.txt
it's breaking sentences at "Dr." which is a bit sad given the subject matter.
I'm hacking around that for now but if it proves to be an issue I may try
finding a tagged corpus to retrain with (I don't have a decent tagged
corpus myself, which is why I'm using the default models).

Thanks,

Ewan.

On Fri, Apr 06, 2018 at 02:06:11PM +0000, Finan, Sean wrote:

> Hi Ewan,
> 
> We use Tim's SentenceDetectorAnnotatorBIO in a project that has run hundreds 
> of notes that contain newline-spanning sentences and it does work very well.  
> As Tim wrote, it does sometimes lump lines together if they aren't prose.  
> One example is lines of text in a list.  There are a few ctakes annotators 
> that can help correct this: ParagraphSentenceFixer and ListSentenceFixer.  
> 
> If you are running the default clinical pipeline, you can use the 
> SectionedFastPipeline.piper in ctakes-clinical-pipeline-res instead of the 
> DefaultFastPipeline.piper
> The difference between the two is that SectionedFastPipeline loads 
> FullTokenizerPipeline.piper in ctakes-core-res instead of 
> DefaultTokenizerPipeline.piper.  The FullTokenizer... has a reference to the 
> Sentence...BIO, and you can enable it by swapping comment specifiers to:
> 
> // The sentence detector needs our custom model path, otherwise default 
> values are used.
> addLogged SentenceDetectorAnnotatorBIO 
> classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar
> 
> // The SentenceDetectorAnnotatorBIO is a "lumper" that works well for notes 
> in which end of line does not indicate a sentence.
> // If that is not your case, then you may get better results using the more 
> standard SentenceDetector
> //add SentenceDetector
> 
> Sean
> 
> -----Original Message-----
> From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] 
> Sent: Friday, April 06, 2018 9:46 AM
> To: dev@ctakes.apache.org
> Subject: Re: SentenceDetector [EXTERNAL] [SUSPICIOUS]
> 
> The changes were mainly meant to adapt the OpenNLP model to idiosyncrasies of 
> clinical text, but you're right that they have some shortcomings.
> 
> The newline thing is in the data sources used originally to build the model, 
> there were frequent cases of headings/sentence fragments by themselves on a 
> line, and _no_ cases of mid-sentence newlines. That, combined with the fact 
> that OpenNLP's train file format (at the time) itself used newlines as a 
> separator, led to the creation of that simple rule rather than trying to 
> retrain with newline as a candidate sentence splitter. I created a different 
> training file format and annotator that does what you suggest, and built an 
> alternative sentence splitter model, here:
> org/apache/ctakes/core/ae/SentenceDetectorAnnotatorBIO.java
> 
> it operates at the character level and splits a document into sentences. For 
> some people it works better. For data where there are potentially 
> mid-sentence newlines (like MIMIC), it is probably the only model with usable 
> results. It's typical failure mode is to lump two sentences together, while 
> the default annotator does the opposite.
> 
> Tim
> 
> 
> On Fri, 2018-04-06 at 02:11 +0000, Ewan Mellor wrote:
> > I'm looking at SentenceDetector from ctakes-core.��It has a surprising 
> > idea of what counts as a "sentence".��Before I delve any deeper, I 
> > wanted to ask whether there is a reason for what it's doing, in 
> > particular whether there's anything in the clinical pipeline that's 
> > depending on its behavior specifically.
> > 
> > The main problem I have is that it's splitting on characters like 
> > colon and semicolon, which aren't usually considered sentence 
> > separators, with the result that it often ends up tagging phrases 
> > rather than whole sentences.
> > 
> > It's using SentenceDetectorCtakes and EndOfSentenceScannerImpl, which 
> > seem to be derived from equivalents in OpenNLP, but with changes that 
> > I can't track (they date from the original edu.mayo import as far as I 
> > can tell).
> > Other than the additional separator characters, I can't tell whether 
> > these classes are doing anything important that you wouldn't equally 
> > get from OpenNLP's SentenceDetectorME, so I don't know why they're 
> > being used.
> > 
> > SentenceDetector is also splitting on newlines after passing the text 
> > through the max entropy sentence model.��I don't see the point in this 
> > -- if you're going to split on newlines anyway, then why not do that 
> > before passing through the entropy model?��Or just have newline as one 
> > of the potential EOS characters and treat it as a possible break point 
> > rather than a definite one?
> > 
> > Any insight would be welcome.
> > 
> > Thanks,
> > 
> > Ewan.

Reply via email to