[
https://issues.apache.org/jira/browse/LUCENE-2470?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12869322#action_12869322
]
Michael McCandless commented on LUCENE-2470:
--------------------------------------------
bq. I think one consequence of this design is that the BranchingFilter/Stage
would have to do its own merging, so MergingFilter is not necessary, right?
Right.
bq. The other uses for a MergingFilter should be put into another issue, if we
go with this design and there is interest, switching this issue to cover only
BranchingFilter/Stage.
These are interesting too!
bq. Do you mean that it should be possible to configure multiple filters to
process the same input token?
Actually I didn't -- I meant that we should allow a sub-pipeline to process 1
token and produce (say) 3. But it is a neat idea to allow more than one sub to
operate; I like the PassThroughFilter.
bq. Before I forget: It's always bugged me that analysis output can only be to
a single field. Could this be the place to fix that?
That's a biggish change :) I think we should tackle it separately -- we'd have
to change indexer for this (right now it visits one field at a time, processing
all of its tokens).
But, I do think this write-once attr approach could be used as a document
pre-processing pipeline, eg to enhance the doc, pull out additional fields, etc.
> Add conditional braching/merging to Lucene's analysis pipeline
> --------------------------------------------------------------
>
> Key: LUCENE-2470
> URL: https://issues.apache.org/jira/browse/LUCENE-2470
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis
> Affects Versions: 4.0
> Reporter: Steven Rowe
> Priority: Minor
>
> Captured from a #lucene brainstorming session with Robert Muir:
> Lucene's analysis pipeline would be more flexible if it were possible to
> apply filter(s) to only part of an input stream's tokens, under
> user-specifiable conditions (e.g. when a given token attribute has a
> particular value) in a way that did not place that responsibility on
> individual filters.
> Two use cases:
> # StandardAnalyzer could directly handle ideographic characters in the same
> way as CJKTokenizer, which generates bigrams, if it could call ShingleFilter
> only when the TypeAttribute=<CJK>, or if Robert's new
> ScriptAttribute=<Ideographic>.
> # Stemming might make sense for some stemmer/domain combinations only when
> token length exceeds some threshold. For example, a user could configure an
> analyzer to stem only when CharTermAttribute length is greater than 4
> characters.
> One potential way to achieve this conditional branching facility is with a
> new kind of filter that can be configured with one or more following filters
> and condition(s) under which the filter should be engaged. This could be
> called BranchingFilter.
> I think a MergingFilter, the inverse of BranchingFilter, is necessary in the
> current pipeline architecture, to have a single pipeline endpoint. A
> MergingFilter might be useful in its own right, e.g. to collect document data
> from multiple sources. Perhaps a conditional merging facility would be
> useful as well.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]