[jira] [Comment Edited] (LUCENE-7960) NGram filters -- add option to keep short terms

Shawn Heisey (JIRA) Sat, 31 Mar 2018 14:29:40 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16421479#comment-16421479
 ]


Shawn Heisey edited comment on LUCENE-7960 at 3/31/18 9:28 PM:
---------------------------------------------------------------

When I created this issue, I didn't think about long terms.  Somebody probably 
needs the functionality.

On further reflection, I don't think that new parameter names should be plural. 
 Using "keepShortTerm" and "keepLongTerm" sounds better to me.  They could both 
be enabled if that's what the user wants.  The same options should be added to 
all ngram analysis components, not just EdgeNgramFilter.

Let's consider a min length of 4 and a max length of 6, using EdgeNgramFilter.

If the input term is "abcdefgh", here's the basic term list from the filter:

abcd abcde abcdef

If keepShortTerm is enabled with the longer input, there is no change.  If 
keepLongTerm is enabled with the longer input, then the term list will be:

abcd abcde abcdef abcdefgh

The seven-character string would not be created.  If that's what the user 
wants, they should just increase the max value, rather than enable the new 
option.

If the input term is "ab", then the filter would not normally produce any 
terms.  With keepShortTerm, the output would be the input -- "ab".

The keepLongTerm option would have no effect with a short input.  

I did glance at the patch, but didn't examine it in detail, so I don't know if 
it does what I just described or not.


was (Author: elyograg):
When I created this issue, I didn't think about long terms.  Somebody probably 
needs the functionality.

On further reflection, I don't think that new parameter names should be plural. 
 Using "keepShortTerm" and "keepLongTerm" sounds better to me.  They could both 
be enabled if that's what the user wants.  The same options should be added to 
all ngram analysis components, not just EdgeNgramFilter.

Let's consider a min length of 4 and a max length of 6, using EdgeNgramFilter.

If the input term is "abcdefgh", here's the basic term list from the filter:

abcd abcde abcdef

If keepShortTerm is enabled with the longer input, there is no change.  If 
keepLongTerm is enabled with the longer input, then the term list will be:

abcd abcde abcdef abcdefgh

The seven-character string would not be created.  If that's what the user 
wants, they should just increase the max value, rather than enable the new 
option.

If the input term is "ab", then the filter would not normally produce any 
terms.  With keepShortTerm, the output would be the input -- "ab".  The 
three-character term would not be produced.  If the user wants that, they would 
need to reduce the min value.

The keepLongTerm option would have no effect with a short input.  

I did glance at the patch, but didn't examine it in detail, so I don't know if 
it does what I just described or not.

> NGram filters -- add option to keep short terms
> -----------------------------------------------
>
>                 Key: LUCENE-7960
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7960
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Shawn Heisey
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> When ngram or edgengram filters are used, any terms that are shorter than the 
> minGramSize are completely removed from the token stream.
> This is probably 100% what was intended, but I've seen it cause a lot of 
> problems for users.  I am not suggesting that the default behavior be 
> changed.  That would be far too disruptive to the existing user base.
> I do think there should be a new boolean option, with a name like 
> keepShortTerms, that defaults to false, to allow the short terms to be 
> preserved.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (LUCENE-7960) NGram filters -- add option to keep short terms

Reply via email to