Thanks again, I'll use this table as well. What I do is read those tables and store in a char[], for fast lookups of folding chars. I noticed your comments in the code about not doing so because then the tables would need to be updated once in a while, and I agree. But ICU's lack of char[] API drove me away from it. I've had bad experience, performance-wise, with it in the past.
I even compared Java's Collator to ICU's, and Java's seemed to perform faster to me, although that wasn't a real performance test. But ICU seems to be more accurate than Java's (which is annoying). I figured that I can apply some rules on my own, but the more I read about contrib/analyzers, contrib/collation, LUCENE-1488 and this thread, I think I'm beginning to understand that "on my own" means staying alert to a lot of stuff I'm not today :). Two comments about the patch in LUCENE-1488. In some places you use StringBuffer and others StringBuilder. Is that intentional? If not, I think you should move to StringBuilder. Also, in ICUCaseFoldingFilter, I believe termAtt can be declared final? Thanks, Shai On Mon, Nov 30, 2009 at 10:46 PM, Robert Muir <rcm...@gmail.com> wrote: > Shai, no, behind the scenes I am using just that table, via ICU library. > > The only reason the CaseFoldingFilter in my patch is more complex, is > because I also apply FC_NFKC_Closure mappings. > You can apply these tables in your impl too if you are also using > normalization, they are here: > http://unicode.org/Public/UNIDATA/DerivedNormalizationProps.txt > > The reasoning for this, is that if you are also normalizing to form NFKC or > NFKD, you would have to do NFKC(Fold(NFKC(Fold(x)))) or > NFKD(Fold(NFKD(Fold(x)))). > > with the mappings instead you can just do NFKC(Fold_w_closure(x)) and > NFKD(Fold_w_closure(x)), and avoid double normalization and folding for > better performance. > > On Mon, Nov 30, 2009 at 3:41 PM, Shai Erera <ser...@gmail.com> wrote: > > > Thanks Robert. In my Analyzer I do case folding according to Unicode > > tables. > > So ß is converted to "SS". I do the same for diacritic removal and > > Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets the > > "SS" to "ss". > > > > I checked the filter's output on "AĞACIN" and it's "AGACIN". If I > > toLowerCase(new Locale("tr")), it's lowered to "agacın", which is > correct. > > Of course, LowerCaseFilter does not do that, I used String's. > > > > I just realized I've included lots of folding tables, except for > > http://unicode.org/Public/UNIDATA/CaseFolding.txt. I guess I counted on > > LowerCaseFilter too much. Is that the table you're working w/ in > > LUCENE-1488? I assume you use more of course :) > > > > Shai > > > > On Mon, Nov 30, 2009 at 10:00 PM, Robert Muir <rcm...@gmail.com> wrote: > > > > > Shai, again the problem is not really performance (I am ignoring that > for > > > now), but the fact that lowercasing and case folding are different. > > > > > > An easy example, the lowercase of ß is ß itself, it is already > lowercase. > > > it will not match with 'SS' if you use lowercase filter. > > > > > > if you use case folding, these two will match. > > > > > > On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera <ser...@gmail.com> wrote: > > > > > > > Robert, what if I need to do additional filtering after > > > CollationKeyFilter, > > > > like stopwords removal, abbreviations handling, stemming etc? Will > that > > > be > > > > possible if I use CollationKeyFilter? > > > > > > > > I also noticed CKF creates a String out of the char[]. If the code > > > already > > > > does that, why not use String.toLowerCase(Locale)? > > > > > > > > Shai > > > > > > > > On Mon, Nov 30, 2009 at 9:46 PM, Simon Willnauer < > > > > simon.willna...@googlemail.com> wrote: > > > > > > > > > On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir <rcm...@gmail.com> > > wrote: > > > > > >> I am not sure if it is worth to add a new TokenFilter for > Turkish > > > > > language. > > > > > >> I see there exist GreekLowerCaseFilter and > RussianLowerCaseFilter. > > > It > > > > > would > > > > > >> be nice to see TurkishLowerCaseFilter in Lucene. > > > > > >> > > > > > >> > > > > > >> > > > > > > just to clarify, GreekLowerCaseFilter really shouldn't exist > > either. > > > > The > > > > > > final sigma problem it has (where there are two lowercase forms > > > > depending > > > > > > upon position in word), this is also solved with unicode case > > folding > > > > or > > > > > > collation. This is a perfect example of how lowercase is the > wrong > > > > > operation > > > > > > for search. > > > > > > > > > > > > and RussianLowerCaseFilter is deprecated now, it does the exact > > same > > > > > thing > > > > > > as LowerCaseFilter. > > > > > btw. we should fix supplementary chars in there too even if it is > > > > > deprecated. > > > > > > > > > > > > > > > > > -- > > > > > > Robert Muir > > > > > > rcm...@gmail.com > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > > > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Robert Muir > > > rcm...@gmail.com > > > > > > > > > -- > Robert Muir > rcm...@gmail.com >