Enable term vectors while indexing and use the TermVector API. ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de
> -----Original Message----- > From: Giovanni Gherdovich [mailto:g.gherdov...@gmail.com] > Sent: Sunday, July 15, 2012 5:57 PM > To: java-user@lucene.apache.org > Subject: from docID to terms enumerator in O(1) ? > > Hi all, > > I'd like to know if I can get the list of indexed terms in a document from its > document ID in constant time (say, in a time independent of the size of the > index). > > The reason why I ask might be relevant > (you could suggest me a totally different way to achieve my goal). > > I want to present the search results of a query as a word cloud, i.e. no scoring, > no sorting, no nothing, just a visual representation of the array of pairs (term, > docFreq) for all terms appearing in at least one of the docs that matched my > query. > > Skimming thru the pages of "Lucene in Action" > I found that I might need to call the method > > void IndexSearcher.search(Query query, Collector results) > > i.e. pass that method my own Collector class, that fetches and cook results the > way I want. > > The author provides a very clear code example for the Collector, > > -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 public class > BookLinkCollector extends Collector { > private Map<String,String> documents = new HashMap<String,String>(); > private Scorer scorer; > private String[] urls; > private String[] titles; > > public boolean acceptsDocsOutOfOrder() { > return true; > } > > public void setScorer(Scorer scorer) { > this.scorer = scorer; > } > > public void setNextReader(IndexReader reader, int docBase) > throws IOException { > urls = FieldCache.DEFAULT.getStrings(reader, "url"); > titles = FieldCache.DEFAULT.getStrings(reader, "title2"); > } > > public void collect(int docID) { > try { > String url = urls[docID]; > String title = titles[docID]; > documents.put(url, title); > System.out.println(title + ":" + scorer.score()); > } catch (IOException e) { > } > } > > public Map<String,String> getLinks() { > return Collections.unmodifiableMap(documents); > } > } > -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 > > which is the used like > > -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 public void > testCollecting() throws Exception { > Directory dir = TestUtil.getBookIndexDirectory(); > TermQuery query = new TermQuery(new Term("contents", "junit")); > IndexSearcher searcher = new IndexSearcher(dir); > BookLinkCollector collector = new BookLinkCollector(searcher); > > searcher.search(query, collector); > Map<String,String> linkMap = collector.getLinks(); > assertEquals("ant in action", > linkMap.get("http://www.manning.com/loughran"));; > searcher.close(); > dir.close(); > } > -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 > > What might not work for me is the use of FieldCache on the IndexReader to > retrieve all fields values on the current segment; those values are returned as > String[], > > while for me it would be more convenient to get a term enumerator: > all the tokenizing and stopword removal work has already been dojne and > indexing time, and I would like to leverage that. > > How does it sound? > > Cheers, > Giovanni > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org