Hi all! A little bit more of exploration:)
After indexing with multiple atomic field values, here is what I get: indexSearcher.doc(0).getFields("gramm") stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000> stored,indexed,tokenized,termVector,omitNorms<gramm:V|1|1> stored,indexed,tokenized,termVector,omitNorms<gramm:PR|1|1> stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1> stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|0|1000 S|1|0> stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1> stored,indexed,tokenized,termVector,omitNorms<gramm:SPRO|1|1000> stored,indexed,tokenized,termVector,omitNorms<gramm:ADV|1|1> stored,indexed,tokenized,termVector,omitNorms<gramm:A|1|1> indexSearcher.doc(0).getField("gramm") stored,indexed,tokenized,termVector,omitNorms<gramm:S|3|1000> The values are absolutely correct, but why getField() returns only the first one instead of concatenating them? If I want to handcraft my custom highlighter, is iterating through (roughly) all the stored field values supposed to be the right technique? (Previously I was using Alanyzer.tokenStream.incrementToken() for the entire concatenated field.) -- Igor 02.10.2013, 21:26, "Igor Shalyminov" <ishalymi...@yandex-team.ru>: > Hi again! > > Here is my problem in more detail: in addition to indexing, I need the > multi-value field to be stored as-is. And if I pass it into the analyzer as > multiple atomic tokens, it stores only the first of them. > What do I need to do to my custom analyzer to make it store all the atomic > tokens concatenated eventually? > > -- > Igor > > 27.09.2013, 18:12, "Igor Shalyminov" <ishalymi...@yandex-team.ru>: > >> Hello! >> >> I have really long document field values. Tokens of these fields are of the >> form: word|payload|position_increment. (I need to control position >> increments and payload manually.) >> I collect these compound tokens for the entire document, then join them >> with a '\t', and then pass this string to my custom analyzer. >> (For the really long field strings something breaks in the >> UnicodeUtil.UTF16toUTF8() with ArrayOutOfBoundsException). >> >> The analyzer is just the following: >> >> class AmbiguousTokenAnalyzer extends Analyzer { >> private PayloadEncoder encoder = new IntegerEncoder(); >> >> @Override >> protected TokenStreamComponents createComponents(String fieldName, >> Reader reader) { >> Tokenizer source = new DelimiterTokenizer('\t', >> EngineInfo.ENGINE_VERSION, reader); >> TokenStream sink = new DelimitedPositionIncrementFilter(source, >> '|'); >> sink = new CustomDelimitedPayloadTokenFilter(sink, '|', encoder); >> sink.addAttribute(OffsetAttribute.class); >> sink.addAttribute(CharTermAttribute.class); >> sink.addAttribute(PayloadAttribute.class); >> sink.addAttribute(PositionIncrementAttribute.class); >> return new TokenStreamComponents(source, sink); >> } >> } >> >> CustomDelimitedPayloadTokenFilter and DelimitedPositionIncrementFilter have >> 'incrementToken' method where the rightmost "|aaa" part of a token is >> processed. >> >> The field is configured as: >> attributeFieldType.setIndexed(true); >> attributeFieldType.setStored(true); >> attributeFieldType.setOmitNorms(true); >> attributeFieldType.setTokenized(true); >> attributeFieldType.setStoreTermVectorOffsets(true); >> attributeFieldType.setStoreTermVectorPositions(true); >> attributeFieldType.setStoreTermVectors(true); >> attributeFieldType.setStoreTermVectorPayloads(true); >> >> The problem is, if I pass to the analyzer the field itself (one huge string >> - via document.add(...) ), it works OK, but if I pass token after token, >> something breaks at the search stage. >> As I read somewhere, these two ways must be the same from the resulting >> index point of view. Maybe my analyzer misses something? >> >> -- >> Best Regards, >> Igor Shalyminov >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org