Re: How to query/search unicoded docs in lucene using unicode text as query?

KK Sat, 23 May 2009 00:46:24 -0700

Thanks Muir.
I dont know whats going wrong but I did a fresh build using simplAanalyzer
and its working. I tried for long sentences say upto 10/12 words also and
its fine. So far so good...
I dont have any idea about unicode normalization. I'll google it and see if
i can make use of it.
That apart, I want a single indexer/searcher that can handle say around 8/10
non-eng[indian] languages. So far using simpleAnalyzer I've trired upto 3
languages for indexing/searching and its working. I dont know if its going
to work for other non-eng indian languages as well.
Muir, if you have some pointers on doing unicode normalization please let me
know. If you think that might help I'ld definitely give it a try.


Thanks,
KK


On Thu, May 21, 2009 at 7:40 PM, Robert Muir <rcm...@gmail.com> wrote:

> hello, your example (hindi), is probably suffering from a number of search
> issues:
>
> i dont recommend standardanalyzer as for this example, it will break words
> around dependent the vowels and nukta dot, etc.
> whitespaceanalyzer might be a good start.
>
> also, is it possible to apply unicode normalization to your text before
> indexing it?
> normalization will standardize things in indian languages.
>
> in your example, the pha + nukda dot you queries on is the normalized form,
> but i wonder if in your text its encoded as fa (u095E)
> if you apply normalization mode NFC it will standardize to pha + nukda dot.
>
> On Thu, May 21, 2009 at 9:26 AM, KK <dioxide.softw...@gmail.com> wrote:
>
> > Hi All,
> > I've indexed some docs[non-english] in unicoded utf=8 format. For both
> > indexing as well as searching/querying I'm using simpleanalyzer. For
> > english
> > texts when I tried with single words its working then I thought of trying
> > for non-english texts. So I wrote those words[multiple words] in
> babelmap[a
> > unicode converter] and got the unicode for the text string and tried that
> > as
> > query but it din't work. Earlier I've used the same method to query solr
> > index which use lucene at the backend. I tried say this query,
> > \u0938\u0941\u0939\u093E\u0928\u093E\u0020\u0938\u092B\u093C\u0930
> > which is unicoded for some non-english text, but this give me zero search
> > result in lucene. I want to know whats going wrong. As I know at the end
> of
> > the day lucene writes my non-english texts in unicodes, so if I'm reading
> > say the index it'll have this kind of characters on the disk, right? So
> > when
> > I query using the same thing it should work. This used to work perfectly
> > well with Solr where I was indexing all docs in unicode utf-8 encoding
> and
> > the query was also unicoded as show above. Can someone point me what is
> > going wrong here?
> > May be I've to have a look over the analyzer solr was using in the
> default
> > setting[i used the default setting only, and pretty sure it was using lot
> > many analyzers/filter factory]. Thanks for all your time and
> appreciation.
> >
> > Thanks,
> > KK.
> >
>
>
>
> --
> Robert Muir
> rcm...@gmail.com
>

Re: How to query/search unicoded docs in lucene using unicode text as query?

Reply via email to