Hi, I create my index with TermVector.WITH_POSITIONS_OFFSETS and get the term offsets with the following code. The code collects two arrays: HFIDs (unique ID's stored with documents) and Highlights (strings with offset info).
Please note that this code requires the patch from bug #36292 (http://issues.apache.org/bugzilla/show_bug.cgi?id=36292) to work with prefix queries. QueryParser parser = new QueryParser("text", analyzer); parser.setDefaultOperator(QueryParser.AND_OPERATOR); Query query=parser.parse(querystr); IndexSearcher searcher=new IndexSearcher(reader); Hits hits = searcher.search(query); //System.out.println("query.getClass()=\""+query.getClass().toString()+"\"") ; HashSet QueryTerms=new HashSet(); query.extractTerms(QueryTerms); int NumHits=hits.length(); int[] HFIDs=new int[NumHits]; String[] Highlights=new String[NumHits]; for (int i = 0; i < NumHits; i++) { Document doc = hits.doc(i); HFIDs[i]=Integer.parseInt(doc.get("hfid")); String HiliString=""; TermPositionVector tpv=(TermPositionVector)reader.getTermFreqVector(hits.id(i), "text"); String[] DocTerms=tpv.getTerms(); int[] freq=tpv.getTermFrequencies(); for (int t = 0; t < freq.length; t++) { if (QueryTerms.contains(new Term("text",DocTerms[t]))) { TermVectorOffsetInfo[] offsets=tpv.getOffsets(t); int[] pos=tpv.getTermPositions(t); for (int tp = 0; tp < pos.length; tp++) { HiliString+=(HiliString!=""?",":"")+offsets[tp].getStartOffset()+"-"+offsets [tp].getEndOffset(); } } } Highlights[i]=HiliString; } -- Mikko Noromaa ([EMAIL PROTECTED]) - tel. +358 40 7348034 Noromaa Solutions - see http://www.nm-sol.com/ > -----Original Message----- > From: Sean O'Connor [mailto:[EMAIL PROTECTED] > Sent: Wednesday, August 24, 2005 12:42 AM > To: java-user@lucene.apache.org > Subject: Example of Field.TermVector.WITH_POSITIONS_OFFSETS usage? > > > Hello, > I am trying to work through term positions and how to get > them from > a collection of hits. Does setting > TermVector.WITH_POSITIONS_OFFSETS to > true save the start/end position of the term in the source > text file? (I > _think_ it does). > > If so, where would I start for trying to make that information > accessible in a "result set"? I believe it would be extending > a query, a > scorer, a hit, and/or a weight object. I will be wanting to > process ALL > hits, so I think will need to implement a hitcollector. > > As an example of what I want, if I were looking for the offset > position of "brown" in a properly indexed field containing "the lazy > brown fox", I would like to get: > start==10 > end==15 (assuming my counting is right) > > Based on Paul Elschot's previous response to a similar question I > had (which I am still working on), I _think_ I need to extend > something > like the ExactPhraseScorer. While debugging with my IDE > (Eclipse) I can > see that the weight object in the scorer contains a reference to the > query. The query contains the fields: > Vector positions (just has ints of term positions in phrase?) > Vector terms (vector of Term, just field name and field contents?) > > The weight also seems to have an array of TermPositions, > which have > SegmentTermPositions. I thought this was what I wanted, but I > don't see > the proper start/end fields, or anything which seems to be on > the right > track. > > Can anyone point me in the right direction? > Thanks, > > Sean > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]