Hindi, diacritics and search results

OBender Fri, 10 Jul 2009 12:10:43 -0700

Hi All,


I'm using the default setup of lucene (no custom analyzers configured) and
came across the following issue:

In Hindi if there is a letter with a diacritic in a phrase lucene will find
the phrase with this letter even if the search string is for the letter
without a diacritics.

Is this an expected behavior? Maybe this is standard for all languages with
letters that have diacritics?

 

>From pure byte standpoint I can see the logic, the letter with diacritics
takes 6 bytes (E0 A4 95 E0 A5 87) and the single letter takes  3 (E0 A4 95)
so if I search for *some_letter* where some letter has code (E0 A4 95)
lucene finds the "phrase" (E0 A4 95 E0 A5 87) that includes that letter.

 

Any comments much appreciated.

 

Thanks.

Hindi, diacritics and search results

Reply via email to