Re: solr getUniqueTermCount() when multiple segments?

Michael McCandless Tue, 07 Sep 2010 01:46:02 -0700

This is expected/intentional, because computing the "true" unique term
count across multiple segments is exceptionally costly (you have to do
the merge sort to de-dup).


If you really want the true count, you can pull the TermsEnum and
.next() until exhaustion.

Alternatively, you can use IndexReader.getSequentialSubReaders(), then
step through each SegReader calling its .getUniqueTermCount() and then
somehow "approximate" (eg the sum will be an upper bound of the total
unique count).

Mike

On Tue, Sep 7, 2010 at 2:34 AM, Ryan McKinley <[email protected]> wrote:
> Hello-
>
> I'm looking at using the new terms.getUniqueTermCount() to give a
> quick count for the LukeRequestHandler rather then needing to walk all
> the terms.
>
> When solr index reader has just one segment, it works great.  However
> with more segments I get:
>
> java.lang.UnsupportedOperationException: this reader does not
> implement getUniqueTermCount()
>        at org.apache.lucene.index.Terms.getUniqueTermCount(Terms.java:84)
>
> Is this expected?  Is there any way around that?
>
> I am getting the terms using:
>
>          Terms terms = MultiFields.getTerms(reader, fieldName);
>          long cnt = (terms==null) ? 0 : terms.getUniqueTermCount();
>
> Thanks
> ryan
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: solr getUniqueTermCount() when multiple segments?

Reply via email to