[ https://issues.apache.org/jira/browse/CTAKES-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ewan Mellor updated CTAKES-520: ------------------------------- Attachment: (was: CTAKES-520.patch) > SentenceDetectorAnnotatorBIO token scanning performance issues > -------------------------------------------------------------- > > Key: CTAKES-520 > URL: https://issues.apache.org/jira/browse/CTAKES-520 > Project: cTAKES > Issue Type: Improvement > Components: ctakes-core > Affects Versions: 4.0.0 > Reporter: Ewan Mellor > Priority: Minor > Attachments: CTAKES-520.patch > > > SentenceDetectorAnnotatorBIO iterates over every character in the Segment and > classifies it as Begin, Inside, or Outside a Sentence. When doing this, it > needs to know the next and previous token from the current character. > It currently finds these tokens afresh for each character. That means that > it starts from the current character, and scans forward and backwards looking > for whitespace until it finds the boundaries of the tokens either side of the > current position. This is very wasteful; when the current index steps within > a word, the tokens do not change since we're still within the same word. > Also, since we're scanning in one direction, we never need to scan for the > previous token, because we already know it. > (I found this bug with a pathological case where I had a "document" with a > single word that was a megabyte long. In a case where the word length is not > bounded, the current algorithm is quadratic instead of linear, because it > scans the length of the word for each character.) > Patch attached. This fixes the problem by keeping track of the word > boundary, and only scanning for the next token when we have reached the > boundary of the current one. Also, the previous token is simply taken as the > token from the previous iteration, and the token features are only recomputed > when the token changes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)