Actually, term vectors can store payloads now (LUCENE-1888), so if that field was indexed with FieldType.setStoreTermVectorPayloads they should be there.
But I suspect the TokenSources.getTokenStream API (which I think un-inverts the term vectors to recreate the token stream = very slow?) wasn't fixed to also carry the payloads through? Mike McCandless http://blog.mikemccandless.com On Tue, Apr 23, 2013 at 7:10 AM, Uwe Schindler <u...@thetaphi.de> wrote: > TermVectors are per-document and do not contain payloads. You are reading > the per-document TermVectors which is a "small index" *stored* for each > document as a binary blob. This blob only contains the terms of this > document with its positions/offsets, but no payloads (offsets are used e.g. > for highlighting). > > To retrieve payloads, you have to use the main TermsEnum and main posting > lists, but this does *not* work per document. In general you would execute > a query and then retrieve the payload for each hit while iterating the > scorer (e.g. function queries can do this). > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: u...@thetaphi.de > > > > -----Original Message----- > > From: Carsten Schnober [mailto:schno...@ids-mannheim.de] > > Sent: Tuesday, April 23, 2013 1:04 PM > > To: java-user > > Subject: Reading Payloads > > > > Hi, > > I'm trying to extract payloads from an index for specific tokens the > following > > way (inserting sample document number and term): > > > > Terms terms = reader.getTermVector(16504, "term"); TokenStream > > tokenstream = TokenSources.getTokenStream(terms); > > while (tokenstream.incrementToken()) { > > OffsetAttribute offset = > tokenstream.getAttribute(OffsetAttribute.class); > > int start = offset.startOffset(); > > int end = offset.endOffset(); > > String token = > > tokenstream.getAttribute(CharTermAttribute.class).toString(); > > > > PayloadAttribute payloadAttr = > > tokenstream.addAttribute(PayloadAttribute.class); > > BytesRef payloadBytes = payloadAttr.getPayload(); > > > > ... > > } > > > > This works fine for the OffsetAttribute and the CharTermAttribute, but > > payloadAttr.getPayload() always returns null for all documents and all > > tokens, unfortunately. However, I know that the payloads are stored in > the > > index as I can retrieve them through a SpanQuery with > Spans.getPayload(). I > > actually expect every token to carry a payload, as I'm my custom > tokenizer > > implementation has the following lines: > > > > public class KoraTokenizer extends Tokenizer { > > ... > > private PayloadAttribute payloadAttr = > > addAttribute(PayloadAttribute.class); > > ... > > public boolean incrementToken() { > > ... > > payloadAttr.setPayload(new BytesRef(payloadString)); > > ... > > } > > ... > > } > > > > I've asserted that the payloadString variable is never an empty String > and as I > > said above, I can retrieve the Payloads with Spans.getPayload(). So what > do I > > do wrong in my > > tokenstream.addAttribute(PayloadAttribute.class) call? BTW, I used > > tokenstream.getAttribute() before as for the other attributes but this > > obviously threw an IllegalArgumentException so I implemented the > > recommendation given in the documentation and replaced it by > > addAttribute(). > > > > Thanks! > > Carsten > > > > > > > > > > -- > > Institut für Deutsche Sprache | http://www.ids-mannheim.de > > Projekt KorAP | http://korap.ids-mannheim.de > > Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de > > Korpusanalyseplattform der nächsten Generation Next Generation Corpus > > Analysis Platform > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > > For additional commands, e-mail: java-user-h...@lucene.apache.org > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > >