Hello! I have really long document field values. Tokens of these fields are of the form: word|payload|position_increment. (I need to control position increments and payload manually.) I collect these compound tokens for the entire document, then join them with a '\t', and then pass this string to my custom analyzer. (For the really long field strings something breaks in the UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException).
The analyzer is just the following: class AmbiguousTokenAnalyzer extends Analyzer { private PayloadEncoder encoder = new IntegerEncoder(); @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { Tokenizer source = new DelimiterTokenizer('\t', EngineInfo.ENGINE_VERSION, reader); TokenStream sink = new DelimitedPositionIncrementFilter(source, '|'); sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder); sink.addAttribute(OffsetAttribute.class); sink.addAttribute(CharTermAttribute.class); sink.addAttribute(PayloadAttribute.class); sink.addAttribute(PositionIncrementAttribute.class); return new TokenStreamComponents(source, sink); } } CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have 'incrementToken' method where the rightmost "|aaa" part of a token is processed. The field is configured as: attributeFieldType.setIndexed(true); attributeFieldType.setStored(true); attributeFieldType.setOmitNorms(true); attributeFieldType.setTokenized(true); attributeFieldType.setStoreTermVectorOffsets(true); attributeFieldType.setStoreTermVectorPositions(true); attributeFieldType.setStoreTermVectors(true); attributeFieldType.setStoreTermVectorPayloads(true); The problem is, if I pass to the analyzer the field itself (one huge string - via document.add(...) ), it works OK, but if I pass token after token, something breaks at the search stage. As I read somewhere, these two ways must be the same from the resulting index point of view. Maybe my analyzer misses something? -- Best Regards, Igor Shalyminov --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org