On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe <sar...@syr.edu> wrote:
> A thought: one way to do #1 without modifying ShingleFilter: if there were a 
> StopFilter variant that accepted regular expressions instead of a stopword 
> list, you could configure it with a regex like /_ .*|.* _| _ / (assuming a 
> full match is required, i.e. implicit beginning and end anchors), and place 
> it in the analysis pipeline after ShingleFilter to throw out shingles with 
> filler tokens in them.
>
> (It think it would be useful to generalize StopFilter to allow for more 
> sources of stoppage, rather than just creating a StopRegexFilter with no 
> relation to StopFilter.)
>

we already did this in 3.1 by making a base FilteringTokenFilter class?
a regex filter is trivial if you subclass this (we could add something
like this untested code to the .pattern package or whatever)

public class PatternRemoveFilter extends FilteringTokenFilter {
  private final Matcher matcher;
  private final CharTermAttribute termAtt =
addAttribute(CharTermAttribute.class);

  public PatternRemoveFilter(boolean enablePositionIncrements,
TokenStream input, Pattern pattern) {
    super(enablePositionIncrements, input);
    matcher = pattern.matcher(termAtt);
  }

  @Override
  protected boolean accept() throws IOException {
    matcher.reset();
    return !matcher.matches();
  }
}

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to