RE: from docID to terms enumerator in O(1) ?

Uwe Schindler Sun, 15 Jul 2012 09:00:10 -0700

Enable term vectors while indexing and use the TermVector API.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: u...@thetaphi.de



> -----Original Message-----
> From: Giovanni Gherdovich [mailto:g.gherdov...@gmail.com]
> Sent: Sunday, July 15, 2012 5:57 PM
> To: java-user@lucene.apache.org
> Subject: from docID to terms enumerator in O(1) ?
> 
> Hi all,
> 
> I'd like to know if I can get the list of indexed terms in a document from
its
> document ID in constant time (say, in a time independent of the size of
the
> index).
> 
> The reason why I ask might be relevant
> (you could suggest me a totally different way to achieve my goal).
> 
> I want to present the search results of a query as a word cloud, i.e. no
scoring,
> no sorting, no nothing, just a visual representation of the array of pairs
(term,
> docFreq) for all terms appearing in at least one of the docs that matched
my
> query.
> 
> Skimming thru the pages of "Lucene in Action"
> I found that I might need to call the method
> 
> void IndexSearcher.search(Query query, Collector results)
> 
> i.e. pass that method my own Collector class, that fetches and cook
results the
> way I want.
> 
> The author provides a very clear code example for the Collector,
> 
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
public class
> BookLinkCollector extends Collector {
>     private Map<String,String> documents = new HashMap<String,String>();
>     private Scorer scorer;
>     private String[] urls;
>     private String[] titles;
> 
>     public boolean acceptsDocsOutOfOrder() {
>       return true;
>     }
> 
>     public void setScorer(Scorer scorer) {
>       this.scorer = scorer;
>     }
> 
>     public void setNextReader(IndexReader reader, int docBase)
>       throws IOException {
>       urls = FieldCache.DEFAULT.getStrings(reader, "url");
>       titles = FieldCache.DEFAULT.getStrings(reader, "title2");
>     }
> 
>     public void collect(int docID) {
>       try {
>           String url = urls[docID];
>           String title = titles[docID];
>           documents.put(url, title);
>           System.out.println(title + ":" + scorer.score());
>       } catch (IOException e) {
>       }
>     }
> 
>     public Map<String,String> getLinks() {
>       return Collections.unmodifiableMap(documents);
>     }
> }
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
> 
> which is the used like
> 
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
public void
> testCollecting() throws Exception {
>     Directory dir = TestUtil.getBookIndexDirectory();
>     TermQuery query = new TermQuery(new Term("contents", "junit"));
>     IndexSearcher searcher = new IndexSearcher(dir);
>     BookLinkCollector collector = new BookLinkCollector(searcher);
> 
>     searcher.search(query, collector);
>     Map<String,String> linkMap = collector.getLinks();
>     assertEquals("ant in action",
>                linkMap.get("http://www.manning.com/loughran";));;
>     searcher.close();
>     dir.close();
> }
> -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8 -- -- >8
> 
> What might not work for me is the use of FieldCache on the IndexReader to
> retrieve all fields values on the current segment; those values are
returned as
> String[],
> 
> while for me it would be more convenient to get a term enumerator:
> all the tokenizing and stopword removal work has already been dojne and
> indexing time, and I would like to leverage that.
> 
> How does it sound?
> 
> Cheers,
> Giovanni
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: from docID to terms enumerator in O(1) ?

Reply via email to