[
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421479#comment-16421479
]
Shawn Heisey edited comment on LUCENE-7960 at 3/31/18 9:28 PM:
---------------------------------------------------------------
When I created this issue, I didn't think about long terms. Somebody probably
needs the functionality.
On further reflection, I don't think that new parameter names should be plural.
Using "keepShortTerm" and "keepLongTerm" sounds better to me. They could both
be enabled if that's what the user wants. The same options should be added to
all ngram analysis components, not just EdgeNgramFilter.
Let's consider a min length of 4 and a max length of 6, using EdgeNgramFilter.
If the input term is "abcdefgh", here's the basic term list from the filter:
abcd abcde abcdef
If keepShortTerm is enabled with the longer input, there is no change. If
keepLongTerm is enabled with the longer input, then the term list will be:
abcd abcde abcdef abcdefgh
The seven-character string would not be created. If that's what the user
wants, they should just increase the max value, rather than enable the new
option.
If the input term is "ab", then the filter would not normally produce any
terms. With keepShortTerm, the output would be the input -- "ab".
The keepLongTerm option would have no effect with a short input.
I did glance at the patch, but didn't examine it in detail, so I don't know if
it does what I just described or not.
was (Author: elyograg):
When I created this issue, I didn't think about long terms. Somebody probably
needs the functionality.
On further reflection, I don't think that new parameter names should be plural.
Using "keepShortTerm" and "keepLongTerm" sounds better to me. They could both
be enabled if that's what the user wants. The same options should be added to
all ngram analysis components, not just EdgeNgramFilter.
Let's consider a min length of 4 and a max length of 6, using EdgeNgramFilter.
If the input term is "abcdefgh", here's the basic term list from the filter:
abcd abcde abcdef
If keepShortTerm is enabled with the longer input, there is no change. If
keepLongTerm is enabled with the longer input, then the term list will be:
abcd abcde abcdef abcdefgh
The seven-character string would not be created. If that's what the user
wants, they should just increase the max value, rather than enable the new
option.
If the input term is "ab", then the filter would not normally produce any
terms. With keepShortTerm, the output would be the input -- "ab". The
three-character term would not be produced. If the user wants that, they would
need to reduce the min value.
The keepLongTerm option would have no effect with a short input.
I did glance at the patch, but didn't examine it in detail, so I don't know if
it does what I just described or not.
> NGram filters -- add option to keep short terms
> -----------------------------------------------
>
> Key: LUCENE-7960
> URL: https://issues.apache.org/jira/browse/LUCENE-7960
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Shawn Heisey
> Priority: Major
> Time Spent: 10m
> Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of
> problems for users. I am not suggesting that the default behavior be
> changed. That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like
> keepShortTerms, that defaults to false, to allow the short terms to be
> preserved.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]