Could you boil down this example to a smaller test case that fails? Eg make a RAMDir, index one document (that should show hilighting), search it, run highlight and show that it's not working?
Mike On Mon, May 25, 2009 at 10:02 AM, KK <dioxide.softw...@gmail.com> wrote: > Hi, > I'm trying to index some non-english texts. Indexing and searching is > working fine. From command line I'm able to provide the utf-8 unicoded text > as input like this, > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE > and able to get the search results. > Then I tried to add hit highlighting for the same. So I started with simple > english texts and used pharse queries for providing input queries. My code > looks like this, > > > import java.io.FileReader; > import java.io.IOException; > import java.io.InputStreamReader; > import java.util.Date; > import java.io.*; > import java.nio.charset.Charset; > > import org.apache.lucene.analysis.Analyzer; > import org.apache.lucene.analysis.standard.StandardAnalyzer; > import org.apache.lucene.document.Document; > import org.apache.lucene.index.FilterIndexReader; > import org.apache.lucene.index.IndexReader; > import org.apache.lucene.index.Term; > import org.apache.lucene.queryParser.QueryParser; > import org.apache.lucene.search.HitCollector; > import org.apache.lucene.search.Hits; > import org.apache.lucene.search.IndexSearcher; > import org.apache.lucene.search.Query; > import org.apache.lucene.search.PhraseQuery; > import org.apache.lucene.search.ScoreDoc; > import org.apache.lucene.search.Searcher; > import org.apache.lucene.search.TopDocCollector; > import org.apache.lucene.search.highlight.Highlighter; > import org.apache.lucene.search.highlight.QueryScorer; > import org.apache.lucene.search.Scorer; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.SimpleAnalyzer; > > > /** Simple command-line based search demo. */ > public class LuceneSearcher { > private static final String indexPath = "/opt/lucene/index" + "/core36"; > //core36 refers to the exact index directory for tamil pages > > private void searchIndex(String terms) throws Exception{ > String queryString = ""; > PhraseQuery phrase = new PhraseQuery(); > String[] termArray = terms.split(" "); > for (int i=0; i<termArray.length; i++) { > System.out.println("adding " + termArray[i]); > //phrase.add(new Term("content", termArray[i])); > //queryString += termArray[i]; > } > / > //phrase.add(new Term("content", "ubuntu")); > String tamilQuery = new > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"); > //tamilQuery = new String("ubuntu"); > phrase.add(new Term("content", tamilQuery)); > phrase.setSlop(1); > System.out.println("phrase query " + phrase.toString()); > > IndexSearcher searcher = new IndexSearcher(indexPath); > QueryParser queryParser = null; > try { > queryParser = new QueryParser("content", new SimpleAnalyzer()); > } catch (Exception ex) { > ex.printStackTrace(); > } > > //Query query = queryParser.parse(queryString); > > Hits hits = null; > try { > hits = searcher.search(phrase); > } catch (Exception ex) { > ex.printStackTrace(); > } > //for highlighter section > QueryScorer scorer = new QueryScorer(phrase); > Highlighter highlighter = new Highlighter(scorer); > > for (int i = 0; i < hits.length(); i++) { > String content = hits.doc(i).get("content"); > TokenStream stream = new SimpleAnalyzer().tokenStream("content", > new StringReader(content)); > String fragment = highlighter.getBestFragments(stream, content, > 5, "..."); > System.out.println(fragment); > } > > > int hitCount = hits.length(); > System.out.println("Results found :" + hitCount); > > /* > for (int ix=0; ix<hitCount; ix++) { > Document doc = hits.doc(ix); > System.out.println(doc.get("content")); > } > */ > } > > public static void main(String args[]) throws Exception{ > LuceneSearcher searcher = new LuceneSearcher(); > String termString = args[0]; > System.out.println("searching for " + args[0]); > searcher.searchIndex(termString); > } > > } > ----------------------code ends here--------------------------------- > NB: Please ignore basic coding conventio[ indentations, comments etc]. You > might find some unneccesary code intermixed with the highlighting code, > ignore them . > > Now when I searched for some english docs I got the results with <b></b> > tags sorrounding the hits like this, > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security notices > that affect the current supported releases of <B>Ubuntu</B>. These notices > are also posted > > Now I thought of testing the same for temil texts. Before this I would like > to add one more information that prior to adding the codes for highlighting > I was able to search a lucene index from the command line using the raw > unicode texts like this, > [...@kk-laptop]$ java LuceneSearcher "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0" > > and it gives me the page that mathces the above query. Now I tried to do the > same alongwith highliting. So in the code I posted above you can see that I > commented out the english terms and added one tamil unicode query and tried > to see If it gives me the same result that I was getting prior to > highlighting and found that I'm not getting any results. This might be > because the query I'm forming using these unicode texts is wrong, or may be > something else. I'm not able to figure out what exactly is going wrong? Some > silly mistake I guess, still I'm not able to find out. Can some one take the > pain to go throgh the above code and find out whats wrong. Thank you very > much. > > Thanks, > KK. > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org