> Hi Ahmet,
>
> After thinking about what Shai brought up, I changed my
> mind and think it is
> not good enough that we only have Collation as a way to
> solve this.
> Because you might want turkish stemming too, and right now
> there is no way
> for the included snowball turkish stemmer to work.
Hi Ahmet,
After thinking about what Shai brought up, I changed my mind and think it is
not good enough that we only have Collation as a way to solve this.
Because you might want turkish stemming too, and right now there is no way
for the included snowball turkish stemmer to work.
I really do not l
On Mon, Nov 30, 2009 at 4:07 PM, Shai Erera wrote:
> Thanks again, I'll use this table as well.
you should only use it if you are normalizing to NFKC or NFKD afterwards...
> What I do is read those tables
> and store in a char[], for fast lookups of folding chars. I noticed your
> comments in
Thanks again, I'll use this table as well. What I do is read those tables
and store in a char[], for fast lookups of folding chars. I noticed your
comments in the code about not doing so because then the tables would need
to be updated once in a while, and I agree. But ICU's lack of char[] API
drov
Shai, no, behind the scenes I am using just that table, via ICU library.
The only reason the CaseFoldingFilter in my patch is more complex, is
because I also apply FC_NFKC_Closure mappings.
You can apply these tables in your impl too if you are also using
normalization, they are here:
http://unico
Thanks Robert. In my Analyzer I do case folding according to Unicode tables.
So ß is converted to "SS". I do the same for diacritic removal and
Hiragana/Katakan folding. I then apply a LowerCaseFilter, which gets the
"SS" to "ss".
I checked the filter's output on "AĞACIN" and it's "AGACIN". If I
t
On Mon, Nov 30, 2009 at 2:53 PM, Shai Erera wrote:
> Robert, what if I need to do additional filtering after CollationKeyFilter,
> like stopwords removal, abbreviations handling, stemming etc? Will that be
> possible if I use CollationKeyFilter?
>
>
Shai, great point. This won't work with Collati
Shai, again the problem is not really performance (I am ignoring that for
now), but the fact that lowercasing and case folding are different.
An easy example, the lowercase of ß is ß itself, it is already lowercase.
it will not match with 'SS' if you use lowercase filter.
if you use case folding,
Robert, what if I need to do additional filtering after CollationKeyFilter,
like stopwords removal, abbreviations handling, stemming etc? Will that be
possible if I use CollationKeyFilter?
I also noticed CKF creates a String out of the char[]. If the code already
does that, why not use String.toLo
Hi Simon,
> > and RussianLowerCaseFilter is deprecated now, it does the exact same
> thing
> > as LowerCaseFilter.
> btw. we should fix supplementary chars in there too even if it is
> deprecated.
Deprecated classes should never change and for sure not add Version ctors!
If somebody wants to use
On Mon, Nov 30, 2009 at 8:08 PM, Robert Muir wrote:
>> I am not sure if it is worth to add a new TokenFilter for Turkish language.
>> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would
>> be nice to see TurkishLowerCaseFilter in Lucene.
>>
>>
>>
> just to clarify, GreekLow
yes, this is what I would do! The downside to using collation in your filter
chain right now, is that then your terms in the index will not be
human-readable. The upside is they will both sort and search the way your
users expect for a huge list of languages.
On Mon, Nov 30, 2009 at 2:22 PM, AHMET
> just to clarify, GreekLowerCaseFilter really shouldn't
> exist either. The
> final sigma problem it has (where there are two lowercase
> forms depending
> upon position in word), this is also solved with unicode
> case folding or
> collation. This is a perfect example of how lowercase is
> the wr
> I am not sure if it is worth to add a new TokenFilter for Turkish language.
> I see there exist GreekLowerCaseFilter and RussianLowerCaseFilter. It would
> be nice to see TurkishLowerCaseFilter in Lucene.
>
>
>
just to clarify, GreekLowerCaseFilter really shouldn't exist either. The
final sigma p
Hello, there is already an issue of this.
The basics are that lowercase with locale is still not even right. because,
its intended for presentation (display), not for case folding.
the problem is case folding is not exposed in the JDK, and you have to use
the alternate "turkish/azeri" mappings an
15 matches
Mail list logo