[jira] [Comment Edited] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

Uwe Schindler (JIRA) Fri, 10 Aug 2018 11:21:08 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576676#comment-16576676
 ]


Uwe Schindler edited comment on LUCENE-8450 at 8/10/18 6:20 PM:
----------------------------------------------------------------

bq. Separately I don't like the correctOffset() method that we already have on 
tokenizer today. maybe it could be in the offsetattributeimpl or similar 
instead.

I think correctOffset should indeed be part of the OffsetAttribute (we need to 
extend the interface). But we have to make sure, that it does not contain any 
hidden state. Attributes are only "beans" with getters and setters and no 
hidden state, and must be symmetric (if you set something by setter, the getter 
must return it unmodified). They can be used as a state on their own (like 
flagsattribute) to control later filters, but they should not have any hidden 
state, that affects how the setters work.

bq. Maybe it makes sense for something like standardtokenizer to offer a 
"decompound hook" or something that is very limited (e.g., not a chain, just 
one thing) so that european language decompounders don't need to duplicate a 
lot of the logic around punctuation and unicode

Actually that the real solution for the decompounding or WordDelimiterFilter. 
Actually all tokenizers should support it. Maybe that can be done in the base 
class and the incrementToken() get's final. Instead the parsing code could push 
tokens that are passed to decompounder and then icrementToken returns them. So 
incrementToken is final and calls some next method on the tokenization and 
passes the result to the decompounder. Which is is a no-op by default.

Another way would be to have a special type of TokenFilter where the input is 
not TokenStream, but instead Tokenizer (constructor takes "Tokenizer" instead 
ok "TokenStream", the "input" field is also Tokenizer). In general 
decompounders should always be directly after the tokenizer (some of them may 
need to lowercase at the moemnt like dictionary based decompounders, but that's 
a bug, IMHO). Those special TokenFilters "know" and can rely on the Tokenizer 
and call correctOffset on them, if they split tokens.


was (Author: thetaphi):
bq. Separately I don't like the correctOffset() method that we already have on 
tokenizer today. maybe it could be in the offsetattributeimpl or similar 
instead.

I think correctOffset should indeed be part of the OffsetAttribute (we need to 
extend the interface). But we have to make sure, that it does not contain any 
hidden state. Attributes are only "beans" with getters and setters and no 
hidden state, and must be symmetric (if you set something by setter, the getter 
must return it unmodified). They can be used as a state on their own (like 
flagsattribute) to control later filters, but they should not have any hidden 
state, that affects how the setters work.

bq. Maybe it makes sense for something like standardtokenizer to offer a 
"decompound hook" or something that is very limited (e.g., not a chain, just 
one thing) so that european language decompounders don't need to duplicate a 
lot of the logic around punctuation and unicode

Actually that the real solution for the decompounding or WordDelimiterFilter. 
Actually all tokenizers should support it. Maybe that can be done in the base 
class and the incrementToken() get's final. Instead the parsing code could push 
tokens that are passed to decompounder and then icrementToken returns them. So 
incrementToken is final and calls some next method on the tokenization and 
passes the result to the decompounder. Which is is a no-op by default.

Another way would be to have a special type of TokenFilter where the input is 
not TokenStream, but instead Tokenizer (constructor takes "Tokenizer" instead 
ok "TokenStream", the "input" field is also Tokenizer). In general 
decompounders should always be directly after the tokenizer (some of them may 
need to lowercase at the moemnt like dictionary based decompounders, but that's 
a bug, IMHO). Those special TokenFilters "know" and can rely on the Tokenizer 
and call correctOffset on them, if they inject tokens.

> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
>                 Key: LUCENE-8450
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8450
>             Project: Lucene - Core
>          Issue Type: Improvement
>            Reporter: Mike Sokolov
>            Priority: Major
>         Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent 
> filters cannot perform simple arithmetic to calculate the original 
> ("correct") offset of a character in the interior of the token. A similar 
> situation exists for Tokenizers, but these can call 
> CharFilter.correctOffset() to map offsets back to their original location in 
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the 
> correct portion of a compound token. For example the german word 
> "außerstand" (meaning afaict "unable to do something") will be decompounded 
> and match "stand and "ausser", but as things are today, offsets are always 
> set using the start and end of the tokens produced by Tokenizer, meaning that 
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{     public CharOffsetMap getCharOffsetMap();}}
> referencing a CharOffsetMap with these methods:
> {{     int correctOffset(int currentOff);}}
>  {{     int uncorrectOffset(int originalOff);}}
>  
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from 
> original offset forward to the current "offset space".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-8450) Enable TokenFilters to assign offsets when splitting tokens

Reply via email to