Group-by in Lucene/Solr has not been solved in a great general way yet
to my knowledge.
Ideally, we would want a solution that does not need to fit into memory.
However, you need the value of the field for each document. to do the
grouping As you are finding, this is not cheap to get. Currently, the
efficient way to get it is to use a FieldCache. This, however, requires
that every distinct value can fit into memory.
Once you have efficient access to the values, you need to be able to
efficiently group the results, again not bounded by memory (which we
already are with the FieldCache).
There are quite a few ways to do this. The simplest is to group until
you have used all the memory you want, then for everything left,
anything that doesnt match a group, write it to a file, if it does,
increment the group count. Use the overflow file as the input in the
next run, repeat until there is no overflow. You can improve on that by
partitioning the overflow file.
And then there are a dozen other methods.
Solr has a patch in JIRA that uses a sorting method. First the results
are sorted on the group-by field, then scanned through for grouping -
all field values that are the same will be next to each other. Finally,
if you really wanted to sort on a different field, another sort is
applied. Thats not ideal IMO, but its a start.
- Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org