To collapse on 30 million distinct values is going to cause memory problems
for sure. If the heap is growing as the result set grows that means you are
likely using a newer version of Solr which collapses into a hashmap. Older
versions of Solr would collapse into an array 30 million in length which
probably would have blown up memory with even small result sets.

I think you're going to need to shard to get this to perform well. With
SolrCloud you can shard on the collapse key (
https://solr.apache.org/guide/8_7/shards-and-indexing-data-in-solrcloud.html#document-routing).
This will send all documents with the same collapse key to the same shard.
Then run the collapse query on the sharded collection.

Joel Bernstein
http://joelsolr.blogspot.com/


On Wed, Mar 23, 2022 at 9:42 PM Jeremy Buckley - IQ-C
<jeremy.buck...@gsa.gov.invalid> wrote:

> The number of documents in the collection is about 90 million. The
> collapse field has about 30 million distinct values, so I guess that
> qualifies as high cardinality.  We used to use result grouping but switched
> to collapse for improved performance.
>
> The faceting fields are more of a mix, 5-10 fields ranging from around a
> dozen to around 250,000 distinct values.
>
> On Wed, Mar 23, 2022 at 8:30 PM Joel Bernstein <joels...@gmail.com> wrote:
>
> > It sounds like you are collapsing on a high cardinality field and/or
> > faceting on high cardinality fields. Can you describe the cardinality of
> > the fields so we can get an idea of how large the problem is?
> >
> > Joel Bernstein
> > http://joelsolr.blogspot.com/
> >
>

Reply via email to