hi Steve,
you are correct. I am using StandardTokenizer. I will look into the
WhitespaceTokenizer and hopefully figure it out.
thank you,
Igal
On 11/1/2012 1:24 PM, Steve Rowe wrote:
Hi Igal,
You didn't say you were using StandardTokenizer, but assuming you are, right
now StandardTokenizer throws away punctuation, so no following filters will see
them.
If StandardTokenizer were modified to also output currently non-tokenized
punctuation as tokens, then you could use a FilteringTokenFilter that removes
any shingle containing commas. See [1] and [3] for previous discussions on
this topic.
For right now, if you use something like WhitespaceTokenizer, you could have a
FilteringTokenFilter to remove shingles with non-final-token commas, and then
another filter that strips commas everywhere.
Steve
[1] Mike McCandless's post on LUCENE-3940
<https://issues.apache.org/jira/browse/LUCENE-3940?focusedCommentId=13243299&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13243299>
[2] dev@l.a.o thread "Improve OOTB behavior: English word-splitting should default to
autoGeneratePhraseQueries=true" <http://markmail.org/message/ewza54azui6knqwf>
On Nov 1, 2012, at 3:44 PM, Igal @ getRailo.org <i...@getrailo.org> wrote:
hi,
I'm trying to migrate to Lucene 4.
in Lucene 3.5 I extended org.apache.lucene.analysis.FilteringTokenFilter and
overrode accept() to remove undesired shingles. in Lucene 4
org.apache.lucene.analysis.FilteringTokenFilter does not exist?
I'm trying to achieve two things:
1) remove shingles that have an empty item.
2) remove shingles when the phrase contains a comma, for example:
for the phrase: "delicious red apples, green pears, and oranges"
I want the following shingles (with a shingle size of 2):
"delicious red", "red apples", "green pears", "and oranges"
(no "apples green" because there's a comma)
(no "pears and" because there's a comma)
any ideas?
TIA
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org