[ 
https://issues.apache.org/jira/browse/CTAKES-520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ewan Mellor updated CTAKES-520:
-------------------------------
    Description: 
SentenceDetectorAnnotatorBIO iterates over every character in the Segment and 
classifies it as Begin, Inside, or Outside a Sentence.  When doing this, it 
needs to know the next and previous token from the current character.

It currently finds these tokens afresh for each character.  That means that it 
starts from the current character, and scans forward and backwards looking for 
whitespace until it finds the boundaries of the tokens either side of the 
current position.  This is very wasteful; when the current index steps within a 
word, the tokens do not change since we're still within the same word.  Also, 
since we're scanning in one direction, we never need to scan for the previous 
token, because we already know it.

(I found this bug with a pathological case where I had a "document" with a 
single word that was a megabyte long.  In a case where the word length is not 
bounded, the current algorithm is quadratic instead of linear, because it scans 
the length of the word for each character.)

Patch attached.  This fixes the problem by keeping track of the word boundary, 
and only scanning for the next token when we have reached the boundary of the 
current one.  Also, the previous token is simply taken as the token from the 
previous iteration, and the token features are only recomputed when the token 
changes.

  was:
SentenceDetectorAnnotatorBIO iterates over every character in the Segment and 
classifies it as Begin, Inside, or Outside a Sentence.  When doing this, it 
needs to know the next and previous token from the current character.

It currently finds these tokens afresh for each character.  That means that it 
starts from the current character, and scans forward and backwards looking for 
whitespace until it finds the boundaries of the tokens either side of the 
current position.  This is very wasteful; when the current index steps within a 
word, the tokens do not change since we're still within the same word.  Also, 
since we're scanning in one direction, we never need to scan for the previous 
token, because we already know it.

(I found this bug with a pathological case where I had a "document" with a 
single word that was a megabyte long.  In a case where the word length is not 
bounded, the current algorithm is quadratic instead of linear, because it scans 
the length of the word for each character.)

Patch attached.  This fixes the problem by keeping track of the word boundary, 
and only scanning for the next token when we have reached the boundary of the 
current one.  Also, the previous token is simply taken as the token from the 
previous iteration.


> SentenceDetectorAnnotatorBIO token scanning performance issues
> --------------------------------------------------------------
>
>                 Key: CTAKES-520
>                 URL: https://issues.apache.org/jira/browse/CTAKES-520
>             Project: cTAKES
>          Issue Type: Improvement
>          Components: ctakes-core
>    Affects Versions: 4.0.0
>            Reporter: Ewan Mellor
>            Priority: Minor
>         Attachments: CTAKES-520.patch
>
>
> SentenceDetectorAnnotatorBIO iterates over every character in the Segment and 
> classifies it as Begin, Inside, or Outside a Sentence.  When doing this, it 
> needs to know the next and previous token from the current character.
> It currently finds these tokens afresh for each character.  That means that 
> it starts from the current character, and scans forward and backwards looking 
> for whitespace until it finds the boundaries of the tokens either side of the 
> current position.  This is very wasteful; when the current index steps within 
> a word, the tokens do not change since we're still within the same word.  
> Also, since we're scanning in one direction, we never need to scan for the 
> previous token, because we already know it.
> (I found this bug with a pathological case where I had a "document" with a 
> single word that was a megabyte long.  In a case where the word length is not 
> bounded, the current algorithm is quadratic instead of linear, because it 
> scans the length of the word for each character.)
> Patch attached.  This fixes the problem by keeping track of the word 
> boundary, and only scanning for the next token when we have reached the 
> boundary of the current one.  Also, the previous token is simply taken as the 
> token from the previous iteration, and the token features are only recomputed 
> when the token changes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to