TokenStreamComponents in Lucene 4.0

Carsten Schnober Mon, 19 Nov 2012 08:45:18 -0800

Hi,
I have recently updated to Lucene 4.0, but having problems with my
custom Analyzer/Tokenizer.


In the days of Lucene 3.6, it would work like this:

0. define constants lucene_version and indexdir
1. create an Analyzer: analyzer = new KoraAnalyzer() (our custom Analyzer)
2. create an IndexWriterConfiguration: config = new
IndexWriterConfig(lucene_version, analyzer)
3. create an IndexWriter writer = (indexdir, config)
4. for each document:
4.1. create a Document: Document doc = new Document()
4.2. create a Field: Field field = new Field("text", layerFile,
Field.Store.YES, Field.Index.ANALYZED_NO_NORMS,
Field.TermVector.WITH_POSITIONS_OFFSETS);
4.3. add field to document: doc.add(field)
4.4. add document to writer: writer.add(doc)
5. close the writer (write to disk)

However, after switching to Lucene 4 and TokenStreamComponents, I'm
getting a strange behaviour: only the first document in the collection
is tokenized properly. The others do appear in the index, but
un-tokenized, although I have tried not to change anything in the logic.
The Analyzer now has this createComponents() method calling the custom
TokenStreamComponents class with my custom Tokenizer:

@Override
protected TokenStreamComponents createComponents(String fieldName,
Reader reader) {
  final Tokenizer source = new KoraTokenizer(reader);
  final TokenStreamComponents tokenstream = new
KoraTokenStreamComponents(source);
  try {
    source.close();
  } catch (IOException e) {
    jlog.error(e.getLocalizedMessage());
    e.printStackTrace();
  }
  return tokenstream;
}


The custom TokenStreamComponents class uses this constructor:

public KoraTokenStreamComponents(Tokenizer tokenizer) {
  super(tokenizer);
  try {
    tokenizer.reset();
  } catch (IOException e) {
    // TODO Auto-generated catch block
    e.printStackTrace();
  }
}


Since I have not changed anything in the Tokenizer, I suspect the error
to be in the new class KoraTokenStreamComponents. This may be due to the
fact that I do not fully understand why the TokenStreamComponents class
has been introduced.
Any hints on that? Thanks!
Carsten

-- 
Institut für Deutsche Sprache | http://www.ids-mannheim.de
Projekt KorAP                 | http://korap.ids-mannheim.de
Tel. +49-(0)621-43740789      | schno...@ids-mannheim.de
Korpusanalyseplattform der nächsten Generation
Next Generation Corpus Analysis Platform

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

TokenStreamComponents in Lucene 4.0

Reply via email to