Indexing documents with multiple field values

Igor Shalyminov Fri, 27 Sep 2013 07:13:31 -0700

Hello!

I have really long document field values. Tokens of these fields are of the 
form: word|payload|position_increment. (I need to control position increments 
and payload manually.)
I collect these compound tokens for the entire document, then join them with a 
'\t', and then pass this string to my custom analyzer.
(For the really long field strings something breaks in the 
UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).


The analyzer is just the following:

class AmbiguousTokenAnalyzer extends Analyzer {
    private PayloadEncoder encoder = new IntegerEncoder();

    @Override
    protected TokenStreamComponents createComponents(String fieldName, Reader 
reader) {
        Tokenizer source = new DelimiterTokenizer('\t', 
EngineInfo.ENGINE_VERSION, reader);
        TokenStream sink = new DelimitedPositionIncrementFilter(source, '|');
        sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder);
        sink.addAttribute(OffsetAttribute.class);
        sink.addAttribute(CharTermAttribute.class);
        sink.addAttribute(PayloadAttribute.class);
        sink.addAttribute(PositionIncrementAttribute.class);
        return new TokenStreamComponents(source, sink);
    }
}

CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 
'incrementToken' method where the rightmost "|aaa" part of a token is processed.

The field is configured as:
        attributeFieldType.setIndexed(true);
        attributeFieldType.setStored(true);
        attributeFieldType.setOmitNorms(true);
        attributeFieldType.setTokenized(true);
        attributeFieldType.setStoreTermVectorOffsets(true);
        attributeFieldType.setStoreTermVectorPositions(true);
        attributeFieldType.setStoreTermVectors(true);
        attributeFieldType.setStoreTermVectorPayloads(true);

The problem is, if I pass to the analyzer the field itself (one huge string - 
via document.add(...) ), it works OK, but if I pass token after token, 
something breaks at the search stage.
As I read somewhere, these two ways must be the same from the resulting index 
point of view. Maybe my analyzer misses something?

-- 
Best Regards,
Igor Shalyminov

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Indexing documents with multiple field values

Reply via email to