On Thu, May 12, 2011 at 1:03 PM, Steven A Rowe <sar...@syr.edu> wrote: > A thought: one way to do #1 without modifying ShingleFilter: if there were a > StopFilter variant that accepted regular expressions instead of a stopword > list, you could configure it with a regex like /_ .*|.* _| _ / (assuming a > full match is required, i.e. implicit beginning and end anchors), and place > it in the analysis pipeline after ShingleFilter to throw out shingles with > filler tokens in them. > > (It think it would be useful to generalize StopFilter to allow for more > sources of stoppage, rather than just creating a StopRegexFilter with no > relation to StopFilter.) >
we already did this in 3.1 by making a base FilteringTokenFilter class? a regex filter is trivial if you subclass this (we could add something like this untested code to the .pattern package or whatever) public class PatternRemoveFilter extends FilteringTokenFilter { private final Matcher matcher; private final CharTermAttribute termAtt = addAttribute(CharTermAttribute.class); public PatternRemoveFilter(boolean enablePositionIncrements, TokenStream input, Pattern pattern) { super(enablePositionIncrements, input); matcher = pattern.matcher(termAtt); } @Override protected boolean accept() throws IOException { matcher.reset(); return !matcher.matches(); } } --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org