Re: Hit highlighting for non-english unicode index/queries not working?

Erick Erickson Tue, 26 May 2009 06:32:26 -0700

It's fairly easy to construct your own analyzer bystringing together some
filters and tokenizers. LIA (1st ed)
had a SynonymAnalyzer. You probably want something like
(WARNING, example only, I'm not even sure it compiles!! Ripped
off  from the WIKI)


public class MyAnalyzer extends Analyzer
{
    public TokenStream tokenStream (String field, final Reader reader) {
            return  new LowercaseFilter (new WhitespaceTokenizer(reader));
   }
}

There are a number of Filters you can string together if you want to, say,
remove stop words etc..

HTH
Erick

On Tue, May 26, 2009 at 6:38 AM, KK <dioxide.softw...@gmail.com> wrote:

> Thank you @Muir.
> I was earlier using simpleanalyzer for all purposes but as you reccomended
> me the whitespace one, I tried to use that analyzer and good thing is that
> I'm able to index/search non-english text as well as supporting hit
> highlighting for these non-english texts. Thank you very much.
> But now there is one silly problem. As whitespaceanalyzer doesnot do
> anything other than separating the tokens based on the space, for english
> pages case-folding is getting missed. Unless I provide the exact words
> including the right cases it doesnot give me results, which is quite
> obivious. As I went thru the LIA 2nd Edn book, found that it mentions we
> can
> use analyzers on document level and also on field level. I was quite amazed
> at the granularity of analysis supported by Lucene. But its there we just
> have to make use of it. So I'm thinking of giving it a try that will help
> me
> support  both english and non-english indexing/searching/highlighting.
> Thank
> you all. Any ideas on the same are always welcome.
>
> Thanks,
> KK.
>
>
> On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rcm...@gmail.com> wrote:
>
> > as mentioned previously, i dont think your text is being analyzed the way
> > you want.
> >
> > SimpleAnalyzer will break your word \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > (பரிணாம) into 3 tokens:
> >
> > \u0BAA\u0BB0
> > \u0BA3
> > \u0BAE
> >
> > Not only does it incorrectly split your word into three words, but it
> > completely drops the dependent vowels (\u0BBF and \u0BBE).
> >
> > This is why i would recommend trying whitespace analyzer instead.
> > Also take a look at the Luke index tool, its a very quick way to see how
> > your words are being analyzed by various analyzers.
> >
> >
> > On Mon, May 25, 2009 at 10:02 AM, KK <dioxide.softw...@gmail.com> wrote:
> >
> > > Hi,
> > > I'm trying to index some non-english texts. Indexing and searching is
> > > working fine. From command line I'm able to provide the utf-8 unicoded
> > text
> > > as input like this,
> > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE
> > > and able to get the search results.
> > > Then I tried to add hit highlighting for the same. So I started with
> > simple
> > > english texts and used pharse queries for providing input queries. My
> > code
> > > looks like this,
> > >
> > >
> > > import java.io.FileReader;
> > > import java.io.IOException;
> > > import java.io.InputStreamReader;
> > > import java.util.Date;
> > > import java.io.*;
> > > import java.nio.charset.Charset;
> > >
> > > import org.apache.lucene.analysis.Analyzer;
> > > import org.apache.lucene.analysis.standard.StandardAnalyzer;
> > > import org.apache.lucene.document.Document;
> > > import org.apache.lucene.index.FilterIndexReader;
> > > import org.apache.lucene.index.IndexReader;
> > > import org.apache.lucene.index.Term;
> > > import org.apache.lucene.queryParser.QueryParser;
> > > import org.apache.lucene.search.HitCollector;
> > > import org.apache.lucene.search.Hits;
> > > import org.apache.lucene.search.IndexSearcher;
> > > import org.apache.lucene.search.Query;
> > > import org.apache.lucene.search.PhraseQuery;
> > > import org.apache.lucene.search.ScoreDoc;
> > > import org.apache.lucene.search.Searcher;
> > > import org.apache.lucene.search.TopDocCollector;
> > > import org.apache.lucene.search.highlight.Highlighter;
> > > import org.apache.lucene.search.highlight.QueryScorer;
> > > import org.apache.lucene.search.Scorer;
> > > import org.apache.lucene.analysis.TokenStream;
> > > import org.apache.lucene.analysis.SimpleAnalyzer;
> > >
> > >
> > > /** Simple command-line based search demo. */
> > > public class LuceneSearcher {
> > >    private static final String indexPath = "/opt/lucene/index" +
> > "/core36";
> > > //core36 refers to the exact index directory for tamil pages
> > >
> > >    private void searchIndex(String terms) throws Exception{
> > >        String queryString = "";
> > >        PhraseQuery phrase = new PhraseQuery();
> > >        String[] termArray = terms.split(" ");
> > >        for (int i=0; i<termArray.length; i++) {
> > >            System.out.println("adding " + termArray[i]);
> > >            //phrase.add(new Term("content", termArray[i]));
> > >            //queryString += termArray[i];
> > >        }
> > >        /
> > >        //phrase.add(new Term("content", "ubuntu"));
> > >        String tamilQuery = new
> > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0");
> > >        //tamilQuery = new String("ubuntu");
> > >        phrase.add(new Term("content", tamilQuery));
> > >        phrase.setSlop(1);
> > >        System.out.println("phrase query " + phrase.toString());
> > >
> > >         IndexSearcher searcher = new IndexSearcher(indexPath);
> > >        QueryParser queryParser = null;
> > >        try {
> > >            queryParser = new QueryParser("content", new
> > SimpleAnalyzer());
> > >        } catch (Exception ex) {
> > >             ex.printStackTrace();
> > >        }
> > >
> > >        //Query query = queryParser.parse(queryString);
> > >
> > >        Hits hits = null;
> > >        try {
> > >             hits = searcher.search(phrase);
> > >        } catch (Exception ex) {
> > >             ex.printStackTrace();
> > >        }
> > >        //for highlighter section
> > >        QueryScorer scorer = new QueryScorer(phrase);
> > >        Highlighter highlighter = new Highlighter(scorer);
> > >
> > >        for (int i = 0; i < hits.length(); i++) {
> > >            String content = hits.doc(i).get("content");
> > >            TokenStream stream = new
> > SimpleAnalyzer().tokenStream("content",
> > > new StringReader(content));
> > >            String fragment = highlighter.getBestFragments(stream,
> > content,
> > > 5, "...");
> > >            System.out.println(fragment);
> > >        }
> > >
> > >
> > >        int hitCount = hits.length();
> > >        System.out.println("Results found :" + hitCount);
> > >
> > >        /*
> > >        for (int ix=0; ix<hitCount; ix++) {
> > >             Document doc = hits.doc(ix);
> > >            System.out.println(doc.get("content"));
> > >        }
> > >        */
> > >    }
> > >
> > >    public static void main(String args[]) throws Exception{
> > >         LuceneSearcher searcher = new LuceneSearcher();
> > >        String termString = args[0];
> > >        System.out.println("searching for " + args[0]);
> > >        searcher.searchIndex(termString);
> > >    }
> > >
> > > }
> > > ----------------------code ends here---------------------------------
> > > NB: Please ignore basic coding conventio[ indentations, comments etc].
> > You
> > > might find some unneccesary code intermixed with the highlighting code,
> > > ignore them .
> > >
> > > Now when I searched for some english docs I got the results with
> <b></b>
> > > tags sorrounding the hits like this,
> > >
> > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home
> > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security
> > notices
> > > that affect the current supported releases of <B>Ubuntu</B>. These
> > notices
> > > are also posted
> > >
> > > Now I thought of testing the same for temil texts. Before this I would
> > like
> > > to add one more information that prior to adding the codes for
> > highlighting
> > > I was able to search a lucene index from the command line using the raw
> > > unicode texts like this,
> > > [...@kk-laptop]$ java LuceneSearcher
> > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"
> > >
> > > and it gives me the page that mathces the above query. Now I tried to
> do
> > > the
> > > same alongwith highliting. So in the code I posted above you can see
> that
> > I
> > > commented out the english terms and added one tamil unicode query and
> > tried
> > > to see If it gives me the same result that I was getting prior to
> > > highlighting and found that I'm not getting any results. This might be
> > > because the query I'm forming using these unicode texts is wrong, or
> may
> > be
> > > something else. I'm not able to figure out what exactly is going wrong?
> > > Some
> > > silly mistake I guess, still I'm not able to find out. Can some one
> take
> > > the
> > > pain to go throgh the above code and find out whats wrong. Thank you
> very
> > > much.
> > >
> > > Thanks,
> > > KK.
> > >
> >
> >
> >
> > --
> > Robert Muir
> > rcm...@gmail.com
> >
>

Re: Hit highlighting for non-english unicode index/queries not working?

Reply via email to