Thank you @Muir. I was earlier using simpleanalyzer for all purposes but as you reccomended me the whitespace one, I tried to use that analyzer and good thing is that I'm able to index/search non-english text as well as supporting hit highlighting for these non-english texts. Thank you very much. But now there is one silly problem. As whitespaceanalyzer doesnot do anything other than separating the tokens based on the space, for english pages case-folding is getting missed. Unless I provide the exact words including the right cases it doesnot give me results, which is quite obivious. As I went thru the LIA 2nd Edn book, found that it mentions we can use analyzers on document level and also on field level. I was quite amazed at the granularity of analysis supported by Lucene. But its there we just have to make use of it. So I'm thinking of giving it a try that will help me support both english and non-english indexing/searching/highlighting. Thank you all. Any ideas on the same are always welcome.
Thanks, KK. On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rcm...@gmail.com> wrote: > as mentioned previously, i dont think your text is being analyzed the way > you want. > > SimpleAnalyzer will break your word \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE > (பரிணாம) into 3 tokens: > > \u0BAA\u0BB0 > \u0BA3 > \u0BAE > > Not only does it incorrectly split your word into three words, but it > completely drops the dependent vowels (\u0BBF and \u0BBE). > > This is why i would recommend trying whitespace analyzer instead. > Also take a look at the Luke index tool, its a very quick way to see how > your words are being analyzed by various analyzers. > > > On Mon, May 25, 2009 at 10:02 AM, KK <dioxide.softw...@gmail.com> wrote: > > > Hi, > > I'm trying to index some non-english texts. Indexing and searching is > > working fine. From command line I'm able to provide the utf-8 unicoded > text > > as input like this, > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE > > and able to get the search results. > > Then I tried to add hit highlighting for the same. So I started with > simple > > english texts and used pharse queries for providing input queries. My > code > > looks like this, > > > > > > import java.io.FileReader; > > import java.io.IOException; > > import java.io.InputStreamReader; > > import java.util.Date; > > import java.io.*; > > import java.nio.charset.Charset; > > > > import org.apache.lucene.analysis.Analyzer; > > import org.apache.lucene.analysis.standard.StandardAnalyzer; > > import org.apache.lucene.document.Document; > > import org.apache.lucene.index.FilterIndexReader; > > import org.apache.lucene.index.IndexReader; > > import org.apache.lucene.index.Term; > > import org.apache.lucene.queryParser.QueryParser; > > import org.apache.lucene.search.HitCollector; > > import org.apache.lucene.search.Hits; > > import org.apache.lucene.search.IndexSearcher; > > import org.apache.lucene.search.Query; > > import org.apache.lucene.search.PhraseQuery; > > import org.apache.lucene.search.ScoreDoc; > > import org.apache.lucene.search.Searcher; > > import org.apache.lucene.search.TopDocCollector; > > import org.apache.lucene.search.highlight.Highlighter; > > import org.apache.lucene.search.highlight.QueryScorer; > > import org.apache.lucene.search.Scorer; > > import org.apache.lucene.analysis.TokenStream; > > import org.apache.lucene.analysis.SimpleAnalyzer; > > > > > > /** Simple command-line based search demo. */ > > public class LuceneSearcher { > > private static final String indexPath = "/opt/lucene/index" + > "/core36"; > > //core36 refers to the exact index directory for tamil pages > > > > private void searchIndex(String terms) throws Exception{ > > String queryString = ""; > > PhraseQuery phrase = new PhraseQuery(); > > String[] termArray = terms.split(" "); > > for (int i=0; i<termArray.length; i++) { > > System.out.println("adding " + termArray[i]); > > //phrase.add(new Term("content", termArray[i])); > > //queryString += termArray[i]; > > } > > / > > //phrase.add(new Term("content", "ubuntu")); > > String tamilQuery = new > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"); > > //tamilQuery = new String("ubuntu"); > > phrase.add(new Term("content", tamilQuery)); > > phrase.setSlop(1); > > System.out.println("phrase query " + phrase.toString()); > > > > IndexSearcher searcher = new IndexSearcher(indexPath); > > QueryParser queryParser = null; > > try { > > queryParser = new QueryParser("content", new > SimpleAnalyzer()); > > } catch (Exception ex) { > > ex.printStackTrace(); > > } > > > > //Query query = queryParser.parse(queryString); > > > > Hits hits = null; > > try { > > hits = searcher.search(phrase); > > } catch (Exception ex) { > > ex.printStackTrace(); > > } > > //for highlighter section > > QueryScorer scorer = new QueryScorer(phrase); > > Highlighter highlighter = new Highlighter(scorer); > > > > for (int i = 0; i < hits.length(); i++) { > > String content = hits.doc(i).get("content"); > > TokenStream stream = new > SimpleAnalyzer().tokenStream("content", > > new StringReader(content)); > > String fragment = highlighter.getBestFragments(stream, > content, > > 5, "..."); > > System.out.println(fragment); > > } > > > > > > int hitCount = hits.length(); > > System.out.println("Results found :" + hitCount); > > > > /* > > for (int ix=0; ix<hitCount; ix++) { > > Document doc = hits.doc(ix); > > System.out.println(doc.get("content")); > > } > > */ > > } > > > > public static void main(String args[]) throws Exception{ > > LuceneSearcher searcher = new LuceneSearcher(); > > String termString = args[0]; > > System.out.println("searching for " + args[0]); > > searcher.searchIndex(termString); > > } > > > > } > > ----------------------code ends here--------------------------------- > > NB: Please ignore basic coding conventio[ indentations, comments etc]. > You > > might find some unneccesary code intermixed with the highlighting code, > > ignore them . > > > > Now when I searched for some english docs I got the results with <b></b> > > tags sorrounding the hits like this, > > > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security > notices > > that affect the current supported releases of <B>Ubuntu</B>. These > notices > > are also posted > > > > Now I thought of testing the same for temil texts. Before this I would > like > > to add one more information that prior to adding the codes for > highlighting > > I was able to search a lucene index from the command line using the raw > > unicode texts like this, > > [...@kk-laptop]$ java LuceneSearcher > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0" > > > > and it gives me the page that mathces the above query. Now I tried to do > > the > > same alongwith highliting. So in the code I posted above you can see that > I > > commented out the english terms and added one tamil unicode query and > tried > > to see If it gives me the same result that I was getting prior to > > highlighting and found that I'm not getting any results. This might be > > because the query I'm forming using these unicode texts is wrong, or may > be > > something else. I'm not able to figure out what exactly is going wrong? > > Some > > silly mistake I guess, still I'm not able to find out. Can some one take > > the > > pain to go throgh the above code and find out whats wrong. Thank you very > > much. > > > > Thanks, > > KK. > > > > > > -- > Robert Muir > rcm...@gmail.com >