Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-12-01 Thread AHMET ARSLAN
> Hi Ahmet, > > After thinking about what Shai brought up, I changed my > mind and think it is > not good enough that we only have Collation as a way to > solve this. > Because you might want turkish stemming too, and right now > there is no way > for the included snowball turkish stemmer to work.

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-12-01 Thread Robert Muir
Hi Ahmet, After thinking about what Shai brought up, I changed my mind and think it is not good enough that we only have Collation as a way to solve this. Because you might want turkish stemming too, and right now there is no way for the included snowball turkish stemmer to work. I really do not l

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
On Mon, Nov 30, 2009 at 4:07 PM, Shai Erera wrote: > Thanks again, I'll use this table as well. you should only use it if you are normalizing to NFKC or NFKD afterwards... > What I do is read those tables > and store in a char[], for fast lookups of folding chars. I noticed your > comments in

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
Thanks again, I'll use this table as well. What I do is read those tables and store in a char[], for fast lookups of folding chars. I noticed your comments in the code about not doing so because then the tables would need to be updated once in a while, and I agree. But ICU's lack of char[] API drov

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
Shai, no, behind the scenes I am using just that table, via ICU library. The only reason the CaseFoldingFilter in my patch is more complex, is because I also apply FC_NFKC_Closure mappings. You can apply these tables in your impl too if you are also using normalization, they are here: http://unico

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
Thanks Robert. In my Analyzer I do case folding according to Unicode tables. So ß is converted to "SS". I do the same for diacritic removal and Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets the "SS" to "ss". I checked the filter's output on "AĞACIN" and it's "AGACIN". If I t

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera wrote: > Robert, what if I need to do additional filtering after CollationKeyFilter, > like stopwords removal, abbreviations handling, stemming etc? Will that be > possible if I use CollationKeyFilter? > > Shai, great point. This won't work with Collati

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
Shai, again the problem is not really performance (I am ignoring that for now), but the fact that lowercasing and case folding are different. An easy example, the lowercase of ß is ß itself, it is already lowercase. it will not match with 'SS' if you use lowercase filter. if you use case folding,

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Shai Erera
Robert, what if I need to do additional filtering after CollationKeyFilter, like stopwords removal, abbreviations handling, stemming etc? Will that be possible if I use CollationKeyFilter? I also noticed CKF creates a String out of the char[]. If the code already does that, why not use String.toLo

RE: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Uwe Schindler
Hi Simon, > > and RussianLowerCaseFilter is deprecated now, it does the exact same > thing > > as LowerCaseFilter. > btw. we should fix supplementary chars in there too even if it is > deprecated. Deprecated classes should never change and for sure not add Version ctors! If somebody wants to use

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Simon Willnauer
On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir wrote: >> I am not sure if it is worth to add a new TokenFilter for Turkish language. >> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would >> be nice to see TurkishLowerCaseFilter in Lucene. >> >> >> > just to clarify, GreekLow

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
yes, this is what I would do! The downside to using collation in your filter chain right now, is that then your terms in the index will not be human-readable. The upside is they will both sort and search the way your users expect for a huge list of languages. On Mon, Nov 30, 2009 at 2:22 PM, AHMET

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread AHMET ARSLAN
> just to clarify, GreekLowerCaseFilter really shouldn't > exist either. The > final sigma problem it has (where there are two lowercase > forms depending > upon position in word), this is also solved with unicode > case folding or > collation. This is a perfect example of how lowercase is > the wr

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
> I am not sure if it is worth to add a new TokenFilter for Turkish language. > I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would > be nice to see TurkishLowerCaseFilter in Lucene. > > > just to clarify, GreekLowerCaseFilter really shouldn't exist either. The final sigma p

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

2009-11-30 Thread Robert Muir
Hello, there is already an issue of this. The basics are that lowercase with locale is still not even right. because, its intended for presentation (display), not for case folding. the problem is case folding is not exposed in the JDK, and you have to use the alternate "turkish/azeri" mappings an