Group-by in Lucene/Solr has not been solved in a great general way yet to my knowledge.

Ideally, we would want a solution that does not need to fit into memory. However, you need the value of the field for each document. to do the grouping As you are finding, this is not cheap to get. Currently, the efficient way to get it is to use a FieldCache. This, however, requires that every distinct value can fit into memory.

Once you have efficient access to the values, you need to be able to efficiently group the results, again not bounded by memory (which we already are with the FieldCache).

There are quite a few ways to do this. The simplest is to group until you have used all the memory you want, then for everything left, anything that doesnt match a group, write it to a file, if it does, increment the group count. Use the overflow file as the input in the next run, repeat until there is no overflow. You can improve on that by partitioning the overflow file.

And then there are a dozen other methods.

Solr has a patch in JIRA that uses a sorting method. First the results are sorted on the group-by field, then scanned through for grouping - all field values that are the same will be next to each other. Finally, if you really wanted to sort on a different field, another sort is applied. Thats not ideal IMO, but its a start.

- Mark

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Reply via email to