Thanks a lot. I think TermPositionsVector will solve my problem. Although it seems to be a little inperformant
Concerning the term representation: our data is way more complex then just phrasal annotation, it was just an example, because I am not allowed to talk about our internal organisation. I will inspect the Payload class, it should help me come up with a solution. On Fri, Oct 9, 2009 at 7:16 PM, David Causse <dcau...@spotter.com> wrote: > Hi, > > we also index linguistic data, but (someone correct me if I'm wrong) you > have to deal with what the lucene store is offering. > You can store > usable on the search side : > - a term (TermAttribute) > - the position of the term (PositionIncrementAttribute) > - an arbitrary payload (PayloadAttribute) > usable when you found results : > - TermVector (no attribute or OffsetAttribute and/or > PositionIncrementAttribute) > - Any data you stored in a field (arbitrary data) > > OffsetAttribute are stored in TermVector (if you specified you wanted > it) you can't search data within the TermPositionVector but you can > iterate your results and ask the reader to return the TermPositionVector > for a specific document and a field. > > Lucene can't store arbitrary Attributes they are only useful in a > analyze pipe. You have to serialize (if you want to search for this > info) the data inside the term itself (eg add a char at the end of term > to describe the part of speech) and inside the Payload for position > specific info (eg a relation id, paragraph id or whatever you want :it's > a byte[]). > > With those techniques you can do many things, you have to be inventive but > with payloads you can do very interesting things. > You can also store the offsets inside the payload and don't bother with > term vector! > Well there is really hundreds of solutions to deal with linguistic data > inside lucene. What is hard is when you have to deal with relations but > a triplet store should be more adapted for this. > > I suggest also to store a serialized form of your internal > representation in the index, it may be more flexible to use it versus > TermPositionvector. > > Hope it helps. > > On Fri, Oct 09, 2009 at 06:11:33PM +0200, Till Kolter wrote: >> I am quite new to Lucene, but I have searched the FAQs and consulted >> the mailinglist archive. I debugged through the source codes as well. >> >> I have writen an Analyzer, that analyzes a stream by sending it to a >> whole pipeline of linguistic processing and uses the internal >> representation to construct a TokenStream, that tokenizes chunks >> (semantic units). The Term-Attribute String hold the abstract >> representations of those units. For further uses (for instance: >> highlighting the results in text), I need access to the >> OffsetAttribute, that I defined in my TokenStream implementation. Like >> in StandardTokenizer I defined an OffsetAttribute to save the left and >> right values of the original chunks. >> >> Now I want to search for all documents containing an >> "AdjectivePhrase", get those APs from the Documents and highlight all >> APs in the found documents. >> >> I tried to find results by getting TermPositions with >> "Reader.termPositions(term)" and then iterate over the positions, but >> the positions only represent the left offset. >> >> Is there another function to get structured results from term queries >> over documents, where I can get the whole set of attributes, that I >> constructed in the TokenStream with addAttribute(Class)? I did not >> find such a function, but I guess I dont know all retrieval methods of >> Lucene, yet. For my search I used the IndexSearcher. >> >> Thanks >> Till Kolter >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org >> For additional commands, e-mail: java-user-h...@lucene.apache.org >> > > -- > David Causse > Spotter > http://www.spotter.com/ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org