Hi Igal, You didn't say you were using StandardTokenizer, but assuming you are, right now StandardTokenizer throws away punctuation, so no following filters will see them.
If StandardTokenizer were modified to also output currently non-tokenized punctuation as tokens, then you could use a FilteringTokenFilter that removes any shingle containing commas. See [1] and [3] for previous discussions on this topic. For right now, if you use something like WhitespaceTokenizer, you could have a FilteringTokenFilter to remove shingles with non-final-token commas, and then another filter that strips commas everywhere. Steve [1] Mike McCandless's post on LUCENE-3940 <https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299> [2] dev@l.a.o thread "Improve OOTB behavior: English word-splitting should default to autoGeneratePhraseQueries=true" <http://markmail.org/message/ewza54azui6knqwf> On Nov 1, 2012, at 3:44 PM, Igal @ getRailo.org <i...@getrailo.org> wrote: > hi, > > I'm trying to migrate to Lucene 4. > > in Lucene 3.5 I extended org.apache.lucene.analysis.FilteringTokenFilter and > overrode accept() to remove undesired shingles. in Lucene 4 > org.apache.lucene.analysis.FilteringTokenFilter does not exist? > > I'm trying to achieve two things: > > 1) remove shingles that have an empty item. > > 2) remove shingles when the phrase contains a comma, for example: > > for the phrase: "delicious red apples, green pears, and oranges" > > I want the following shingles (with a shingle size of 2): > > "delicious red", "red apples", "green pears", "and oranges" > (no "apples green" because there's a comma) > (no "pears and" because there's a comma) > > any ideas? > > TIA > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org