Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

Robert Muir Mon, 30 Nov 2009 12:01:41 -0800

Shai, again the problem is not really performance (I am ignoring that for
now), but the fact that lowercasing and case folding are different.


An easy example, the lowercase of ß is ß itself, it is already lowercase.
it will not match with 'SS' if you use lowercase filter.

if you use case folding, these two will match.

On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera <ser...@gmail.com> wrote:

> Robert, what if I need to do additional filtering after CollationKeyFilter,
> like stopwords removal, abbreviations handling, stemming etc? Will that be
> possible if I use CollationKeyFilter?
>
> I also noticed CKF creates a String out of the char[]. If the code already
> does that, why not use String.toLowerCase(Locale)?
>
> Shai
>
> On Mon, Nov 30, 2009 at 9:46 PM, Simon Willnauer <
> simon.willna...@googlemail.com> wrote:
>
> > On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir <rcm...@gmail.com> wrote:
> > >> I am not sure if it is worth to add a new TokenFilter for Turkish
> > language.
> > >> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It
> > would
> > >> be nice to see TurkishLowerCaseFilter in Lucene.
> > >>
> > >>
> > >>
> > > just to clarify, GreekLowerCaseFilter really shouldn't exist either.
> The
> > > final sigma problem it has (where there are two lowercase forms
> depending
> > > upon position in word), this is also solved with unicode case folding
> or
> > > collation. This is a perfect example of how lowercase is the wrong
> > operation
> > > for search.
> > >
> > > and RussianLowerCaseFilter is deprecated now, it does the exact same
> > thing
> > > as LowerCaseFilter.
> > btw. we should fix supplementary chars in there too even if it is
> > deprecated.
> >
> > >
> > > --
> > > Robert Muir
> > > rcm...@gmail.com
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> > For additional commands, e-mail: java-user-h...@lucene.apache.org
> >
> >
>



-- 
Robert Muir
rcm...@gmail.com

Re: LowerCaseFilter fails one letter (I) of Turkish alphabet

Reply via email to