Hi, I have recently updated to Lucene 4.0, but having problems with my custom Analyzer/Tokenizer.
In the days of Lucene 3.6, it would work like this: 0. define constants lucene_version and indexdir 1. create an Analyzer: analyzer = new KoraAnalyzer() (our custom Analyzer) 2. create an IndexWriterConfiguration: config = new IndexWriterConfig(lucene_version, analyzer) 3. create an IndexWriter writer = (indexdir, config) 4. for each document: 4.1. create a Document: Document doc = new Document() 4.2. create a Field: Field field = new Field("text", layerFile, Field.Store.YES, Field.Index.ANALYZED_NO_NORMS, Field.TermVector.WITH_POSITIONS_OFFSETS); 4.3. add field to document: doc.add(field) 4.4. add document to writer: writer.add(doc) 5. close the writer (write to disk) However, after switching to Lucene 4 and TokenStreamComponents, I'm getting a strange behaviour: only the first document in the collection is tokenized properly. The others do appear in the index, but un-tokenized, although I have tried not to change anything in the logic. The Analyzer now has this createComponents() method calling the custom TokenStreamComponents class with my custom Tokenizer: @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { final Tokenizer source = new KoraTokenizer(reader); final TokenStreamComponents tokenstream = new KoraTokenStreamComponents(source); try { source.close(); } catch (IOException e) { jlog.error(e.getLocalizedMessage()); e.printStackTrace(); } return tokenstream; } The custom TokenStreamComponents class uses this constructor: public KoraTokenStreamComponents(Tokenizer tokenizer) { super(tokenizer); try { tokenizer.reset(); } catch (IOException e) { // TODO Auto-generated catch block e.printStackTrace(); } } Since I have not changed anything in the Tokenizer, I suspect the error to be in the new class KoraTokenStreamComponents. This may be due to the fact that I do not fully understand why the TokenStreamComponents class has been introduced. Any hints on that? Thanks! Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org