Thanks for the replies, both of you. I have filed https://issues.apache.org/jira/browse/CTAKES-507 with a patch to put those details in the class comments.
I'm using double-newline as a paragraph separator, with mid-sentence single newlines (the text is coming from an OCR phase so it's broken into lines). I'm getting reasonable results now, by adding a custom annotator first that adds Segment annotations for each of the paragraphs, and then using SentenceDetectorAnnotatorBIO. I then don't need ParagraphSentenceFixer because my sentences can't span paragraphs as SentenceDetectorAnnotatorBIO only sees one paragraph at a time. I did find one place where it failed -- using the default tokenCounts.txt it's breaking sentences at "Dr." which is a bit sad given the subject matter. I'm hacking around that for now but if it proves to be an issue I may try finding a tagged corpus to retrain with (I don't have a decent tagged corpus myself, which is why I'm using the default models). Thanks, Ewan. On Fri, Apr 06, 2018 at 02:06:11PM +0000, Finan, Sean wrote: > Hi Ewan, > > We use Tim's SentenceDetectorAnnotatorBIO in a project that has run hundreds > of notes that contain newline-spanning sentences and it does work very well. > As Tim wrote, it does sometimes lump lines together if they aren't prose. > One example is lines of text in a list. There are a few ctakes annotators > that can help correct this: ParagraphSentenceFixer and ListSentenceFixer. > > If you are running the default clinical pipeline, you can use the > SectionedFastPipeline.piper in ctakes-clinical-pipeline-res instead of the > DefaultFastPipeline.piper > The difference between the two is that SectionedFastPipeline loads > FullTokenizerPipeline.piper in ctakes-core-res instead of > DefaultTokenizerPipeline.piper. The FullTokenizer... has a reference to the > Sentence...BIO, and you can enable it by swapping comment specifiers to: > > // The sentence detector needs our custom model path, otherwise default > values are used. > addLogged SentenceDetectorAnnotatorBIO > classifierJarPath=/org/apache/ctakes/core/sentdetect/model.jar > > // The SentenceDetectorAnnotatorBIO is a "lumper" that works well for notes > in which end of line does not indicate a sentence. > // If that is not your case, then you may get better results using the more > standard SentenceDetector > //add SentenceDetector > > Sean > > -----Original Message----- > From: Miller, Timothy [mailto:timothy.mil...@childrens.harvard.edu] > Sent: Friday, April 06, 2018 9:46 AM > To: dev@ctakes.apache.org > Subject: Re: SentenceDetector [EXTERNAL] [SUSPICIOUS] > > The changes were mainly meant to adapt the OpenNLP model to idiosyncrasies of > clinical text, but you're right that they have some shortcomings. > > The newline thing is in the data sources used originally to build the model, > there were frequent cases of headings/sentence fragments by themselves on a > line, and _no_ cases of mid-sentence newlines. That, combined with the fact > that OpenNLP's train file format (at the time) itself used newlines as a > separator, led to the creation of that simple rule rather than trying to > retrain with newline as a candidate sentence splitter. I created a different > training file format and annotator that does what you suggest, and built an > alternative sentence splitter model, here: > org/apache/ctakes/core/ae/SentenceDetectorAnnotatorBIO.java > > it operates at the character level and splits a document into sentences. For > some people it works better. For data where there are potentially > mid-sentence newlines (like MIMIC), it is probably the only model with usable > results. It's typical failure mode is to lump two sentences together, while > the default annotator does the opposite. > > Tim > > > On Fri, 2018-04-06 at 02:11 +0000, Ewan Mellor wrote: > > I'm looking at SentenceDetector from ctakes-core.��It has a surprising > > idea of what counts as a "sentence".��Before I delve any deeper, I > > wanted to ask whether there is a reason for what it's doing, in > > particular whether there's anything in the clinical pipeline that's > > depending on its behavior specifically. > > > > The main problem I have is that it's splitting on characters like > > colon and semicolon, which aren't usually considered sentence > > separators, with the result that it often ends up tagging phrases > > rather than whole sentences. > > > > It's using SentenceDetectorCtakes and EndOfSentenceScannerImpl, which > > seem to be derived from equivalents in OpenNLP, but with changes that > > I can't track (they date from the original edu.mayo import as far as I > > can tell). > > Other than the additional separator characters, I can't tell whether > > these classes are doing anything important that you wouldn't equally > > get from OpenNLP's SentenceDetectorME, so I don't know why they're > > being used. > > > > SentenceDetector is also splitting on newlines after passing the text > > through the max entropy sentence model.��I don't see the point in this > > -- if you're going to split on newlines anyway, then why not do that > > before passing through the entropy model?��Or just have newline as one > > of the potential EOS characters and treat it as a possible break point > > rather than a definite one? > > > > Any insight would be welcome. > > > > Thanks, > > > > Ewan.