hi Steve,

you are correct. I am using StandardTokenizer. I will look into the WhitespaceTokenizer and hopefully figure it out.

thank you,


Igal


On 11/1/2012 1:24 PM, Steve Rowe wrote:
Hi Igal,

You didn't say you were using StandardTokenizer, but assuming you are, right 
now StandardTokenizer throws away punctuation, so no following filters will see 
them.

If StandardTokenizer were modified to also output currently non-tokenized 
punctuation as tokens, then you could use a FilteringTokenFilter that removes 
any shingle containing commas.   See [1] and [3] for previous discussions on 
this topic.

For right now, if you use something like WhitespaceTokenizer, you could have a 
FilteringTokenFilter to remove shingles with non-final-token commas, and then 
another filter that strips commas everywhere.

Steve

[1] Mike McCandless's post on LUCENE-3940 
<https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>

[2] dev@l.a.o thread "Improve OOTB behavior: English word-splitting should default to 
autoGeneratePhraseQueries=true" <http://markmail.org/message/ewza54azui6knqwf>

On Nov 1, 2012, at 3:44 PM, Igal @ getRailo.org <i...@getrailo.org> wrote:

hi,

I'm trying to migrate to Lucene 4.

in Lucene 3.5 I extended org.apache.lucene.analysis.FilteringTokenFilter and 
overrode accept() to remove undesired shingles.  in Lucene 4 
org.apache.lucene.analysis.FilteringTokenFilter does not exist?

I'm trying to achieve two things:

1) remove shingles that have an empty item.

2) remove shingles when the phrase contains a comma, for example:

    for the phrase:    "delicious red apples, green pears, and oranges"

I want the following shingles (with a shingle size of 2):

"delicious red", "red apples", "green pears", "and oranges"
(no "apples green" because there's a comma)
(no "pears and" because there's a comma)

any ideas?

TIA

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to