[jira] [Updated] (CTAKES-520) SentenceDetectorAnnotatorBIO token scanning performance issues

Ewan Mellor (JIRA) Thu, 16 Aug 2018 15:30:40 -0700


     [ 
https://issues.apache.org/jira/browse/CTAKES-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ewan Mellor updated CTAKES-520:
-------------------------------
    Attachment:     (was: CTAKES-520.patch)

> SentenceDetectorAnnotatorBIO token scanning performance issues
> --------------------------------------------------------------
>
>                 Key: CTAKES-520
>                 URL: https://issues.apache.org/jira/browse/CTAKES-520
>             Project: cTAKES
>          Issue Type: Improvement
>          Components: ctakes-core
>    Affects Versions: 4.0.0
>            Reporter: Ewan Mellor
>            Priority: Minor
>         Attachments: CTAKES-520.patch
>
>
> SentenceDetectorAnnotatorBIO iterates over every character in the Segment and 
> classifies it as Begin, Inside, or Outside a Sentence.  When doing this, it 
> needs to know the next and previous token from the current character.
> It currently finds these tokens afresh for each character.  That means that 
> it starts from the current character, and scans forward and backwards looking 
> for whitespace until it finds the boundaries of the tokens either side of the 
> current position.  This is very wasteful; when the current index steps within 
> a word, the tokens do not change since we're still within the same word.  
> Also, since we're scanning in one direction, we never need to scan for the 
> previous token, because we already know it.
> (I found this bug with a pathological case where I had a "document" with a 
> single word that was a megabyte long.  In a case where the word length is not 
> bounded, the current algorithm is quadratic instead of linear, because it 
> scans the length of the word for each character.)
> Patch attached.  This fixes the problem by keeping track of the word 
> boundary, and only scanning for the next token when we have reached the 
> boundary of the current one.  Also, the previous token is simply taken as the 
> token from the previous iteration, and the token features are only recomputed 
> when the token changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CTAKES-520) SentenceDetectorAnnotatorBIO token scanning performance issues

Reply via email to