It's fairly easy to construct your own analyzer bystringing together some filters and tokenizers. LIA (1st ed) had a SynonymAnalyzer. You probably want something like (WARNING, example only, I'm not even sure it compiles!! Ripped off from the WIKI)
public class MyAnalyzer extends Analyzer { public TokenStream tokenStream (String field, final Reader reader) { return new LowercaseFilter (new WhitespaceTokenizer(reader)); } } There are a number of Filters you can string together if you want to, say, remove stop words etc.. HTH Erick On Tue, May 26, 2009 at 6:38 AM, KK <dioxide.softw...@gmail.com> wrote: > Thank you @Muir. > I was earlier using simpleanalyzer for all purposes but as you reccomended > me the whitespace one, I tried to use that analyzer and good thing is that > I'm able to index/search non-english text as well as supporting hit > highlighting for these non-english texts. Thank you very much. > But now there is one silly problem. As whitespaceanalyzer doesnot do > anything other than separating the tokens based on the space, for english > pages case-folding is getting missed. Unless I provide the exact words > including the right cases it doesnot give me results, which is quite > obivious. As I went thru the LIA 2nd Edn book, found that it mentions we > can > use analyzers on document level and also on field level. I was quite amazed > at the granularity of analysis supported by Lucene. But its there we just > have to make use of it. So I'm thinking of giving it a try that will help > me > support both english and non-english indexing/searching/highlighting. > Thank > you all. Any ideas on the same are always welcome. > > Thanks, > KK. > > > On Tue, May 26, 2009 at 1:24 AM, Robert Muir <rcm...@gmail.com> wrote: > > > as mentioned previously, i dont think your text is being analyzed the way > > you want. > > > > SimpleAnalyzer will break your word \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE > > (பரிணாம) into 3 tokens: > > > > \u0BAA\u0BB0 > > \u0BA3 > > \u0BAE > > > > Not only does it incorrectly split your word into three words, but it > > completely drops the dependent vowels (\u0BBF and \u0BBE). > > > > This is why i would recommend trying whitespace analyzer instead. > > Also take a look at the Luke index tool, its a very quick way to see how > > your words are being analyzed by various analyzers. > > > > > > On Mon, May 25, 2009 at 10:02 AM, KK <dioxide.softw...@gmail.com> wrote: > > > > > Hi, > > > I'm trying to index some non-english texts. Indexing and searching is > > > working fine. From command line I'm able to provide the utf-8 unicoded > > text > > > as input like this, > > > \u0BAA\u0BB0\u0BBF\u0BA3\u0BBE\u0BAE > > > and able to get the search results. > > > Then I tried to add hit highlighting for the same. So I started with > > simple > > > english texts and used pharse queries for providing input queries. My > > code > > > looks like this, > > > > > > > > > import java.io.FileReader; > > > import java.io.IOException; > > > import java.io.InputStreamReader; > > > import java.util.Date; > > > import java.io.*; > > > import java.nio.charset.Charset; > > > > > > import org.apache.lucene.analysis.Analyzer; > > > import org.apache.lucene.analysis.standard.StandardAnalyzer; > > > import org.apache.lucene.document.Document; > > > import org.apache.lucene.index.FilterIndexReader; > > > import org.apache.lucene.index.IndexReader; > > > import org.apache.lucene.index.Term; > > > import org.apache.lucene.queryParser.QueryParser; > > > import org.apache.lucene.search.HitCollector; > > > import org.apache.lucene.search.Hits; > > > import org.apache.lucene.search.IndexSearcher; > > > import org.apache.lucene.search.Query; > > > import org.apache.lucene.search.PhraseQuery; > > > import org.apache.lucene.search.ScoreDoc; > > > import org.apache.lucene.search.Searcher; > > > import org.apache.lucene.search.TopDocCollector; > > > import org.apache.lucene.search.highlight.Highlighter; > > > import org.apache.lucene.search.highlight.QueryScorer; > > > import org.apache.lucene.search.Scorer; > > > import org.apache.lucene.analysis.TokenStream; > > > import org.apache.lucene.analysis.SimpleAnalyzer; > > > > > > > > > /** Simple command-line based search demo. */ > > > public class LuceneSearcher { > > > private static final String indexPath = "/opt/lucene/index" + > > "/core36"; > > > //core36 refers to the exact index directory for tamil pages > > > > > > private void searchIndex(String terms) throws Exception{ > > > String queryString = ""; > > > PhraseQuery phrase = new PhraseQuery(); > > > String[] termArray = terms.split(" "); > > > for (int i=0; i<termArray.length; i++) { > > > System.out.println("adding " + termArray[i]); > > > //phrase.add(new Term("content", termArray[i])); > > > //queryString += termArray[i]; > > > } > > > / > > > //phrase.add(new Term("content", "ubuntu")); > > > String tamilQuery = new > > > String("\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0"); > > > //tamilQuery = new String("ubuntu"); > > > phrase.add(new Term("content", tamilQuery)); > > > phrase.setSlop(1); > > > System.out.println("phrase query " + phrase.toString()); > > > > > > IndexSearcher searcher = new IndexSearcher(indexPath); > > > QueryParser queryParser = null; > > > try { > > > queryParser = new QueryParser("content", new > > SimpleAnalyzer()); > > > } catch (Exception ex) { > > > ex.printStackTrace(); > > > } > > > > > > //Query query = queryParser.parse(queryString); > > > > > > Hits hits = null; > > > try { > > > hits = searcher.search(phrase); > > > } catch (Exception ex) { > > > ex.printStackTrace(); > > > } > > > //for highlighter section > > > QueryScorer scorer = new QueryScorer(phrase); > > > Highlighter highlighter = new Highlighter(scorer); > > > > > > for (int i = 0; i < hits.length(); i++) { > > > String content = hits.doc(i).get("content"); > > > TokenStream stream = new > > SimpleAnalyzer().tokenStream("content", > > > new StringReader(content)); > > > String fragment = highlighter.getBestFragments(stream, > > content, > > > 5, "..."); > > > System.out.println(fragment); > > > } > > > > > > > > > int hitCount = hits.length(); > > > System.out.println("Results found :" + hitCount); > > > > > > /* > > > for (int ix=0; ix<hitCount; ix++) { > > > Document doc = hits.doc(ix); > > > System.out.println(doc.get("content")); > > > } > > > */ > > > } > > > > > > public static void main(String args[]) throws Exception{ > > > LuceneSearcher searcher = new LuceneSearcher(); > > > String termString = args[0]; > > > System.out.println("searching for " + args[0]); > > > searcher.searchIndex(termString); > > > } > > > > > > } > > > ----------------------code ends here--------------------------------- > > > NB: Please ignore basic coding conventio[ indentations, comments etc]. > > You > > > might find some unneccesary code intermixed with the highlighting code, > > > ignore them . > > > > > > Now when I searched for some english docs I got the results with > <b></b> > > > tags sorrounding the hits like this, > > > > > > <B>Ubuntu</B> Press Releases Media Contact <B>Ubuntu</B> News Home > > > <B>Ubuntu</B> Security NoticesThese are the <B>Ubuntu</B> security > > notices > > > that affect the current supported releases of <B>Ubuntu</B>. These > > notices > > > are also posted > > > > > > Now I thought of testing the same for temil texts. Before this I would > > like > > > to add one more information that prior to adding the codes for > > highlighting > > > I was able to search a lucene index from the command line using the raw > > > unicode texts like this, > > > [...@kk-laptop]$ java LuceneSearcher > > "\u0BAA\u0BBF\u0BB0\u0B9A\u0BC1\u0BB0" > > > > > > and it gives me the page that mathces the above query. Now I tried to > do > > > the > > > same alongwith highliting. So in the code I posted above you can see > that > > I > > > commented out the english terms and added one tamil unicode query and > > tried > > > to see If it gives me the same result that I was getting prior to > > > highlighting and found that I'm not getting any results. This might be > > > because the query I'm forming using these unicode texts is wrong, or > may > > be > > > something else. I'm not able to figure out what exactly is going wrong? > > > Some > > > silly mistake I guess, still I'm not able to find out. Can some one > take > > > the > > > pain to go throgh the above code and find out whats wrong. Thank you > very > > > much. > > > > > > Thanks, > > > KK. > > > > > > > > > > > -- > > Robert Muir > > rcm...@gmail.com > > >