Ewan Mellor created CTAKES-520:
----------------------------------

             Summary: SentenceDetectorAnnotatorBIO token scanning performance 
issues
                 Key: CTAKES-520
                 URL: https://issues.apache.org/jira/browse/CTAKES-520
             Project: cTAKES
          Issue Type: Improvement
          Components: ctakes-core
    Affects Versions: 4.0.0
            Reporter: Ewan Mellor


SentenceDetectorAnnotatorBIO iterates over every character in the Segment and 
classifies it as Begin, Inside, or Outside a Sentence.  When doing this, it 
needs to know the next and previous token from the current character.

It currently finds these tokens afresh for each character.  That means that it 
starts from the current character, and scans forward and backwards looking for 
whitespace until it finds the boundaries of the tokens either side of the 
current position.  This is very wasteful; when the current index steps within a 
word, the tokens do not change since we're still within the same word.  Also, 
since we're scanning in one direction, we never need to scan for the previous 
token, because we already know it.

(I found this bug with a pathological case where I had a "document" with a 
single word that was a megabyte long.  In a case where the word length is not 
bounded, the current algorithm is quadratic instead of linear, because it scans 
the length of the word for each character.)

Patch attached.  This fixes the problem by keeping track of the word boundary, 
and only scanning for the next token when we have reached the boundary of the 
current one.  Also, the previous token is simply taken as the token from the 
previous iteration.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to