[jira] [Commented] (LUCENE-5674) A new token filter: SubSequence

Nitzan Shaked (JIRA) Sun, 25 May 2014 21:28:18 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-5674?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14008566#comment-14008566
 ]


Nitzan Shaked commented on LUCENE-5674:
---------------------------------------

Ahmet:

1) I'll attach a "squashed" version of the patch, without history, hopefully 
that'll be easier to read.
2) I don't know how to "prove" that something can't be done using existing 
analysis components, but after spending quite some time on this, and after 
asking on S.O., I am fairly convinced that it indeed cannot be done using 
existing components.
3) Instantiating with minLen>maxLen is ok, since maxLen can be negative (-2 to 
count 2 sub-tokens from the end, for example). It might also happen that minLen 
may be greater than some tokens' lengths. In those cases there will simply be 
no output for the given token. I'll add a check that when both minLen and 
maxLen are positive, then minLen <= maxLen.

Otis: while I'm adding this last check, I'll also add the "reverse" option, I 
can see why that might be useful.

> A new token filter: SubSequence
> -------------------------------
>
>                 Key: LUCENE-5674
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5674
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/other
>            Reporter: Nitzan Shaked
>            Priority: Minor
>         Attachments: subseqfilter.patch
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> A new configurable token filter which, given a token breaks it into sub-parts 
> and outputs consecutive sub-sequences of those sub-parts.
> Useful for, for example, using during indexing to generate variations on 
> domain names, so that "www.google.com" can be found by searching for 
> "google.com", or "www.google.com".
> Parameters:
> sepRegexp: A regular expression used split incoming tokens into sub-parts.
> glue: A string used to concatenate sub-parts together when creating 
> sub-sequences.
> minLen: Minimum length (in sub-parts) of output sub-sequences
> maxLen: Maximum length (in sub-parts) of output sub-sequences (0 for 
> unlimited; negative numbers for token length in sub-parts minus specified 
> length)
> anchor: Anchor.START to output only prefixes, or Anchor.END to output only 
> suffixes, or Anchor.NONE to output any sub-sequence
> withOriginal: whether to output also the original token
> EDIT: now includes tests for filter and for factory.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-5674) A new token filter: SubSequence

Reply via email to