Add conditional braching/merging to Lucene's analysis pipeline
--------------------------------------------------------------

                 Key: LUCENE-2470
                 URL: https://issues.apache.org/jira/browse/LUCENE-2470
             Project: Lucene - Java
          Issue Type: New Feature
          Components: Analysis
    Affects Versions: 4.0
            Reporter: Steven Rowe
            Priority: Minor


Captured from a #lucene brainstorming session with Robert Muir:

Lucene's analysis pipeline would be more flexible if it were possible to apply 
filter(s) to only part of an input stream's tokens, under user-specifiable 
conditions (e.g. when a given token attribute has a particular value) in a way 
that did not place that responsibility on individual filters.

Two use cases:

# StandardAnalyzer could directly handle ideographic characters in the same way 
as CJKTokenizer, which generates bigrams, if it could call ShingleFilter only 
when the TypeAttribute=<CJK>, or if Robert's new ScriptAttribute=<Ideographic>.
# Stemming might make sense for some stemmer/domain combinations only when 
token length exceeds some threshold.  For example, a user could configure an 
analyzer to stem only when CharTermAttribute length is greater than 4 
characters.

One potential way to achieve this conditional branching facility is with a new 
kind of filter that can be configured with one or more following filters and 
condition(s) under which the filter should be engaged.  This could be called 
BranchingFilter.

I think a MergingFilter, the inverse of BranchingFilter, is necessary in the 
current pipeline architecture, to have a single pipeline endpoint.  A 
MergingFilter might be useful in its own right, e.g. to collect document data 
from multiple sources.  Perhaps a conditional merging facility would be useful 
as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to