Thanks Muir. I dont know whats going wrong but I did a fresh build using simplAanalyzer and its working. I tried for long sentences say upto 10/12 words also and its fine. So far so good... I dont have any idea about unicode normalization. I'll google it and see if i can make use of it. That apart, I want a single indexer/searcher that can handle say around 8/10 non-eng[indian] languages. So far using simpleAnalyzer I've trired upto 3 languages for indexing/searching and its working. I dont know if its going to work for other non-eng indian languages as well. Muir, if you have some pointers on doing unicode normalization please let me know. If you think that might help I'ld definitely give it a try.
Thanks, KK On Thu, May 21, 2009 at 7:40 PM, Robert Muir <rcm...@gmail.com> wrote: > hello, your example (hindi), is probably suffering from a number of search > issues: > > i dont recommend standardanalyzer as for this example, it will break words > around dependent the vowels and nukta dot, etc. > whitespaceanalyzer might be a good start. > > also, is it possible to apply unicode normalization to your text before > indexing it? > normalization will standardize things in indian languages. > > in your example, the pha + nukda dot you queries on is the normalized form, > but i wonder if in your text its encoded as fa (u095E) > if you apply normalization mode NFC it will standardize to pha + nukda dot. > > On Thu, May 21, 2009 at 9:26 AM, KK <dioxide.softw...@gmail.com> wrote: > > > Hi All, > > I've indexed some docs[non-english] in unicoded utf=8 format. For both > > indexing as well as searching/querying I'm using simpleanalyzer. For > > english > > texts when I tried with single words its working then I thought of trying > > for non-english texts. So I wrote those words[multiple words] in > babelmap[a > > unicode converter] and got the unicode for the text string and tried that > > as > > query but it din't work. Earlier I've used the same method to query solr > > index which use lucene at the backend. I tried say this query, > > \u0938\u0941\u0939\u093E\u0928\u093E\u0020\u0938\u092B\u093C\u0930 > > which is unicoded for some non-english text, but this give me zero search > > result in lucene. I want to know whats going wrong. As I know at the end > of > > the day lucene writes my non-english texts in unicodes, so if I'm reading > > say the index it'll have this kind of characters on the disk, right? So > > when > > I query using the same thing it should work. This used to work perfectly > > well with Solr where I was indexing all docs in unicode utf-8 encoding > and > > the query was also unicoded as show above. Can someone point me what is > > going wrong here? > > May be I've to have a look over the analyzer solr was using in the > default > > setting[i used the default setting only, and pretty sure it was using lot > > many analyzers/filter factory]. Thanks for all your time and > appreciation. > > > > Thanks, > > KK. > > > > > > -- > Robert Muir > rcm...@gmail.com >