Hi again! Here is my problem in more detail: in addition to indexing, I need the multi-value field to be stored as-is. And if I pass it into the analyzer as multiple atomic tokens, it stores only the first of them. What do I need to do to my custom analyzer to make it store all the atomic tokens concatenated eventually?
-- Igor 27.09.2013, 18:12, "Igor Shalyminov" <ishalymi...@yandex-team.ru>: > Hello! > > I have really long document field values. Tokens of these fields are of the > form: word|payload|position_increment. (I need to control position increments > and payload manually.) > I collect these compound tokens for the entire document, then join them with > a '\t', and then pass this string to my custom analyzer. > (For the really long field strings something breaks in the > UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException). > > The analyzer is just the following: > > class AmbiguousTokenAnalyzer extends Analyzer { > private PayloadEncoder encoder = new IntegerEncoder(); > > @Override > protected TokenStreamComponents createComponents(String fieldName, Reader > reader) { > Tokenizer source = new DelimiterTokenizer('\t', > EngineInfo.ENGINE_VERSION, reader); > TokenStream sink = new DelimitedPositionIncrementFilter(source, '|'); > sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder); > sink.addAttribute(OffsetAttribute.class); > sink.addAttribute(CharTermAttribute.class); > sink.addAttribute(PayloadAttribute.class); > sink.addAttribute(PositionIncrementAttribute.class); > return new TokenStreamComponents(source, sink); > } > } > > CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have > 'incrementToken' method where the rightmost "|aaa" part of a token is > processed. > > The field is configured as: > attributeFieldType.setIndexed(true); > attributeFieldType.setStored(true); > attributeFieldType.setOmitNorms(true); > attributeFieldType.setTokenized(true); > attributeFieldType.setStoreTermVectorOffsets(true); > attributeFieldType.setStoreTermVectorPositions(true); > attributeFieldType.setStoreTermVectors(true); > attributeFieldType.setStoreTermVectorPayloads(true); > > The problem is, if I pass to the analyzer the field itself (one huge string - > via document.add(...) ), it works OK, but if I pass token after token, > something breaks at the search stage. > As I read somewhere, these two ways must be the same from the resulting index > point of view. Maybe my analyzer misses something? > > -- > Best Regards, > Igor Shalyminov > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org