Ewan Mellor created CTAKES-520: ---------------------------------- Summary: SentenceDetectorAnnotatorBIO token scanning performance issues Key: CTAKES-520 URL: https://issues.apache.org/jira/browse/CTAKES-520 Project: cTAKES Issue Type: Improvement Components: ctakes-core Affects Versions: 4.0.0 Reporter: Ewan Mellor
SentenceDetectorAnnotatorBIO iterates over every character in the Segment and classifies it as Begin, Inside, or Outside a Sentence. When doing this, it needs to know the next and previous token from the current character. It currently finds these tokens afresh for each character. That means that it starts from the current character, and scans forward and backwards looking for whitespace until it finds the boundaries of the tokens either side of the current position. This is very wasteful; when the current index steps within a word, the tokens do not change since we're still within the same word. Also, since we're scanning in one direction, we never need to scan for the previous token, because we already know it. (I found this bug with a pathological case where I had a "document" with a single word that was a megabyte long. In a case where the word length is not bounded, the current algorithm is quadratic instead of linear, because it scans the length of the word for each character.) Patch attached. This fixes the problem by keeping track of the word boundary, and only scanning for the next token when we have reached the boundary of the current one. Also, the previous token is simply taken as the token from the previous iteration. -- This message was sent by Atlassian JIRA (v7.6.3#76005)