Re: WordDelimiterFilterFactory and StandardTokenizer

2014-05-20 Thread Diego Fernandez
Hey Ahmet, Yeah I had missed Shawn's response, I'll have to give that a try as well. As for the version, we're using 4.4. StandardTokenizer sets type for HANGUL, HIRAGANA, IDEOGRAPHIC, KATAKANA, and SOUTHEAST_ASIAN and you're right, we're using TypeTokenFilter to remove those. Diego Fernand

Re: WordDelimiterFilterFactory and StandardTokenizer

2014-05-20 Thread Ahmet Arslan
Hi Diego, Did you miss Shawn's response? His ICUTokenizerFactory solution is better than mine.  By the way, what solr version are you using? Does StandardTokenizer set type attribute for CJK words? To filter out given types, you not need a custom filter. Type Token filter serves exactly that

Re: WordDelimiterFilterFactory and StandardTokenizer

2014-05-20 Thread Diego Fernandez
Great, thanks for the information! Right now we're using the StandardTokenizer types to filter out CJK characters with a custom filter. I'll test using MappingCharFilters, although I'm a little concerned with possible adverse scenarios. Diego Fernandez - 爱国 Software Engineer US GSS Supporta

Re: WordDelimiterFilterFactory and StandardTokenizer

2014-05-16 Thread Shawn Heisey
On 5/16/2014 9:24 AM, aiguofer wrote: > Jack Krupansky-2 wrote >> Typically the white space tokenizer is the best choice when the word >> delimiter filter will be used. >> >> -- Jack Krupansky > > If we wanted to keep the StandardTokenizer (because we make use of the token > types) but wanted to

Re: WordDelimiterFilterFactory and StandardTokenizer

2014-05-16 Thread Ahmet Arslan
Hi Aiguofer, You mean ClassicTokenizer? Because StandardTokenizer does not set token types (e-mail, url, etc). I wouldn't go with the JFlex edit, mainly because maintenance costs. It will be a burden to maintain a custom tokenizer. MappingCharFilters could be used to manipulate tokenizer beha

Re: WordDelimiterFilterFactory and StandardTokenizer

2014-05-16 Thread aiguofer
Jack Krupansky-2 wrote > Typically the white space tokenizer is the best choice when the word > delimiter filter will be used. > > -- Jack Krupansky If we wanted to keep the StandardTokenizer (because we make use of the token types) but wanted to use the WDFF to get combinations of words that ar

Re: WordDelimiterFilterFactory and StandardTokenizer

2014-04-16 Thread Jack Krupansky
Typically the white space tokenizer is the best choice when the word delimiter filter will be used. -- Jack Krupansky -Original Message- From: Shawn Heisey Sent: Wednesday, April 16, 2014 11:03 PM To: solr-user@lucene.apache.org Subject: Re: WordDelimiterFilterFactory and

Re: WordDelimiterFilterFactory and StandardTokenizer

2014-04-16 Thread Shawn Heisey
On 4/16/2014 8:37 PM, Bob Laferriere wrote: >> I am seeing odd behavior from WordDelimiterFilterFactory (WDFF) when >> used in conjunction with StandardTokenizerFactory (STF). >> I see the following results for the document of “wi-fi”: >> >> Index: “wi”, “fi” >> Query: “wi”,”fi”,”wifi” >> >