[
https://issues.apache.org/jira/browse/LUCENE-8450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576676#comment-16576676
]
Uwe Schindler edited comment on LUCENE-8450 at 8/10/18 6:20 PM:
----------------------------------------------------------------
bq. Separately I don't like the correctOffset() method that we already have on
tokenizer today. maybe it could be in the offsetattributeimpl or similar
instead.
I think correctOffset should indeed be part of the OffsetAttribute (we need to
extend the interface). But we have to make sure, that it does not contain any
hidden state. Attributes are only "beans" with getters and setters and no
hidden state, and must be symmetric (if you set something by setter, the getter
must return it unmodified). They can be used as a state on their own (like
flagsattribute) to control later filters, but they should not have any hidden
state, that affects how the setters work.
bq. Maybe it makes sense for something like standardtokenizer to offer a
"decompound hook" or something that is very limited (e.g., not a chain, just
one thing) so that european language decompounders don't need to duplicate a
lot of the logic around punctuation and unicode
Actually that the real solution for the decompounding or WordDelimiterFilter.
Actually all tokenizers should support it. Maybe that can be done in the base
class and the incrementToken() get's final. Instead the parsing code could push
tokens that are passed to decompounder and then icrementToken returns them. So
incrementToken is final and calls some next method on the tokenization and
passes the result to the decompounder. Which is is a no-op by default.
Another way would be to have a special type of TokenFilter where the input is
not TokenStream, but instead Tokenizer (constructor takes "Tokenizer" instead
ok "TokenStream", the "input" field is also Tokenizer). In general
decompounders should always be directly after the tokenizer (some of them may
need to lowercase at the moemnt like dictionary based decompounders, but that's
a bug, IMHO). Those special TokenFilters "know" and can rely on the Tokenizer
and call correctOffset on them, if they split tokens.
was (Author: thetaphi):
bq. Separately I don't like the correctOffset() method that we already have on
tokenizer today. maybe it could be in the offsetattributeimpl or similar
instead.
I think correctOffset should indeed be part of the OffsetAttribute (we need to
extend the interface). But we have to make sure, that it does not contain any
hidden state. Attributes are only "beans" with getters and setters and no
hidden state, and must be symmetric (if you set something by setter, the getter
must return it unmodified). They can be used as a state on their own (like
flagsattribute) to control later filters, but they should not have any hidden
state, that affects how the setters work.
bq. Maybe it makes sense for something like standardtokenizer to offer a
"decompound hook" or something that is very limited (e.g., not a chain, just
one thing) so that european language decompounders don't need to duplicate a
lot of the logic around punctuation and unicode
Actually that the real solution for the decompounding or WordDelimiterFilter.
Actually all tokenizers should support it. Maybe that can be done in the base
class and the incrementToken() get's final. Instead the parsing code could push
tokens that are passed to decompounder and then icrementToken returns them. So
incrementToken is final and calls some next method on the tokenization and
passes the result to the decompounder. Which is is a no-op by default.
Another way would be to have a special type of TokenFilter where the input is
not TokenStream, but instead Tokenizer (constructor takes "Tokenizer" instead
ok "TokenStream", the "input" field is also Tokenizer). In general
decompounders should always be directly after the tokenizer (some of them may
need to lowercase at the moemnt like dictionary based decompounders, but that's
a bug, IMHO). Those special TokenFilters "know" and can rely on the Tokenizer
and call correctOffset on them, if they inject tokens.
> Enable TokenFilters to assign offsets when splitting tokens
> -----------------------------------------------------------
>
> Key: LUCENE-8450
> URL: https://issues.apache.org/jira/browse/LUCENE-8450
> Project: Lucene - Core
> Issue Type: Improvement
> Reporter: Mike Sokolov
> Priority: Major
> Attachments: offsets.patch
>
>
> CharFilters and TokenFilters may alter token lengths, meaning that subsequent
> filters cannot perform simple arithmetic to calculate the original
> ("correct") offset of a character in the interior of the token. A similar
> situation exists for Tokenizers, but these can call
> CharFilter.correctOffset() to map offsets back to their original location in
> the input stream. There is no such API for TokenFilters.
> This issue calls for adding an API to support use cases like highlighting the
> correct portion of a compound token. For example the german word
> "außerstand" (meaning afaict "unable to do something") will be decompounded
> and match "stand and "ausser", but as things are today, offsets are always
> set using the start and end of the tokens produced by Tokenizer, meaning that
> highlighters will match the entire compound.
> I'm proposing to add this method to `TokenStream`:
> {{ public CharOffsetMap getCharOffsetMap();}}
> referencing a CharOffsetMap with these methods:
> {{ int correctOffset(int currentOff);}}
> {{ int uncorrectOffset(int originalOff);}}
>
> The uncorrectOffset method is a pseudo-inverse of correctOffset, mapping from
> original offset forward to the current "offset space".
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]