Caveat to the below is that I am very new to lucene. (That said though, following the below strategy, after a couple of days work I have a set of per field analyzers for various languages, using various custom filters, caching of initial analysis; and capable of outputting stemmed, reversed, diacritic/accent-less content, which is a lot more than I expected when I started out--hat tip to all those developers of lucene!)
I found this http://www.java2s.com/Open-Source/Java-Document/Search-Engine/lucene/org.apache.lucene.analysis.htm together with looking at the code of current implementations (analysis package and contrib analyzers) was a good way to get up and running fairly quickly: [1] http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/ [2] http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/src/java/org/apache/lucene/analysis/standard/ [3] http://svn.apache.org/viewvc/lucene/dev/branches/branch_3x/lucene/contrib/analyzers/common/src/java/org/apache/lucene/analysis/ Code and documentation for StandardAnalyzer [2] (createComponents override) and its class heirarchy: StopWordAnalyzerBase [1] extends ReusableAnalyzerBase [1] extends Analyzer [1] are where I started. Generally creating your own analyzers will be a matter of overriding the tokenStream and reusableTokenStream methods either directly or, if ReusableAnalyzerBase is in the heirarchy (it usually is if you are using any of the language analyzers), indirectly by overriding the createComponents method. The idiom is then usually { src = new SomeTokenizer(..., reader,...) tokenStream = new SomeFilter(...,src,...) tokenStream = new AnotherFilter(...,tokenStream,...) ... tokenStream = new YetAnotherFilter(...,tokenStream,...) // if overriding createComponents return new TokenStreamComponents(src,tokenStream); // else // return tokenStream. } For filters and attribute use (see tutorial link) I found LowerCaseFilter [1] (for use of CharTermAttribute) FilteringTokenFilter [1] (for use of PositionIncrementAttribute) and SynonymFilter [3 (synonyms/)] helpful. Other classes I have found useful to know at this stage are: PerFieldAnalyzerWrapper (derives from Analyzer, so overrides tokenStream & reuseableTokenStream): useful for applying different analysis to individual fields of a single document. CachingTokenFilter & TeeSinkTokenFilter: useful for avoiding duplication of (expensive) analysis where fields share a common initial analysis. So far I have found currently available tokenizers meet my needs, so I have not looked at implementing my own yet; though the code base is probably as good a place as any to start for that too, after which I would guess the parsing of the input stream would become the complicated bit. Maybe someone else can chip in on that? Hope this helps, kind regard, graham On Mon, Aug 22, 2011 at 8:10 AM, Saar Carmi <saarca...@gmail.com> wrote: > Hi > Where can I find a guide for building analyzers, filters and tokenizers? > > Saar >