Thank you very much Yonik. I downloaded the latest Solr build, pulled the WordDelimiterFilter and used it with the same option as used by Solr default and it worked like a charm. Thanks to Robert also.
Thanks, KK On Tue, Jun 9, 2009 at 7:01 PM, Yonik Seeley <yo...@lucidimagination.com>wrote: > I just cut'n'pasted your word into Solr... it worked fine (it didn't > split the word). > Make sure you're using the latest from the trunk version of Solr... > this was fixed since 1.3 > > http://localhost:8983/solr/select?q=साल&debugQuery=true > [...] > <lst name="debug"> > <str name="rawquerystring">साल</str> > <str name="querystring">साल</str> > <str name="parsedquery">text:साल</str> > <str name="parsedquery_toString">text:साल</str> > > -Yonik > > > On Tue, Jun 9, 2009 at 7:48 AM, KK <dioxide.softw...@gmail.com> wrote: > > Hi Robert, I tried a sample code to check whats the reason. The > > worddelimiterfilter uses isLetter() method to tokenize, and for hindi > words > > some parts of word are not actually letters but just part of the word[but > > that doesnot mean they can be used as word delimiters], since they are > not > > letters isLetter() returns false and the word is getting breaked around > > that. This is some sample code with a hindi word pronounced saal[meaning > > year in english], > > > > import java.lang.String; > > > > public class HindiUnicodeTest { > > public static void main(String args[]) { > > String hindiStr = "साल"; > > int length = hindiStr.length(); > > System.out.println("str length " + length); > > for (int i=0; i<length; i++) { > > System.out.println(hindiStr.charAt(i) + " is " + > > Character.isLetter(hindiStr.charAt(i))); > > } > > > > } > > } > > > > Running this gives this output, > > str length 3 > > स is true > > ा is false > > ल is true > > > > As you can see the second one is false, which says that it is not a > letter > > but this makes worddelimiterfilter break/tokenize around the word. I even > > tried to use my custom parser[which I mentioned earlier] and tried to > print > > the string that is the output after the query getting parsed, and what I > > found is that if I send the above hindi word then the query string after > > being parsed is something like this, > > Parsed Query string: स ल > > it essentialy removes the non-letter character[the second one], and it > seems > > it treats them as separate and whenever thse two characters appear > adjacent, > > they are in th top of result set, also whereever these two letters appers > in > > the doc, it says they are part of the result set [and hence highlights > > them]. > > > > I hope I made it clear. Do let me if some more information is required. > > > > Thanks, > > KK. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >