Hey Ahmet,
Yeah I had missed Shawn's response, I'll have to give that a try as well. As
for the version, we're using 4.4. StandardTokenizer sets type for HANGUL,
HIRAGANA, IDEOGRAPHIC, KATAKANA, and SOUTHEAST_ASIAN and you're right, we're
using TypeTokenFilter to remove those.
Diego Fernand
Hi Diego,
Did you miss Shawn's response? His ICUTokenizerFactory solution is better than
mine.
By the way, what solr version are you using? Does StandardTokenizer set type
attribute for CJK words?
To filter out given types, you not need a custom filter. Type Token filter
serves exactly that
Great, thanks for the information! Right now we're using the StandardTokenizer
types to filter out CJK characters with a custom filter. I'll test using
MappingCharFilters, although I'm a little concerned with possible adverse
scenarios.
Diego Fernandez - 爱国
Software Engineer
US GSS Supporta
On 5/16/2014 9:24 AM, aiguofer wrote:
> Jack Krupansky-2 wrote
>> Typically the white space tokenizer is the best choice when the word
>> delimiter filter will be used.
>>
>> -- Jack Krupansky
>
> If we wanted to keep the StandardTokenizer (because we make use of the token
> types) but wanted to
Hi Aiguofer,
You mean ClassicTokenizer? Because StandardTokenizer does not set token types
(e-mail, url, etc).
I wouldn't go with the JFlex edit, mainly because maintenance costs. It will be
a burden to maintain a custom tokenizer.
MappingCharFilters could be used to manipulate tokenizer beha
Jack Krupansky-2 wrote
> Typically the white space tokenizer is the best choice when the word
> delimiter filter will be used.
>
> -- Jack Krupansky
If we wanted to keep the StandardTokenizer (because we make use of the token
types) but wanted to use the WDFF to get combinations of words that ar
Typically the white space tokenizer is the best choice when the word
delimiter filter will be used.
-- Jack Krupansky
-Original Message-
From: Shawn Heisey
Sent: Wednesday, April 16, 2014 11:03 PM
To: solr-user@lucene.apache.org
Subject: Re: WordDelimiterFilterFactory and
On 4/16/2014 8:37 PM, Bob Laferriere wrote:
>> I am seeing odd behavior from WordDelimiterFilterFactory (WDFF) when
>> used in conjunction with StandardTokenizerFactory (STF).
>> I see the following results for the document of “wi-fi”:
>>
>> Index: “wi”, “fi”
>> Query: “wi”,”fi”,”wifi”
>>
>