Hello, In a Lucene index I have documents containing a number of single-token text fields indexed as StringField. I would like to query the most frequent terms from single-token values of each field - like a top 10 of occurrences - and be able to perform the query on a subset of documents effectively.
I have tried out several approaches with Lucene 6.4.2: - using HighFreqTerms like: HighFreqTerms.getHighFreqTerms( searcher.getIndexReader(), TOP_N_COUNT, fieldName, new HighFreqTerms.DocFreqComparator() ); It is pretty convenient, however it cannot be done on a subset of documents. - using GroupingSearch. This way I can filter documents in the index. At the same time, the resulting groups cannot be sorted by the number of occurrences. As a workaround I could specify large number of result groups and then sort them by the number of hits. It's far from perfect, since in the worst case field values could be unique for each document, leading to high memory consumption. - using Facets API by adding a FacetField for each document field and utilizing FastTaxonomyFacetCounts for querying top N values. With this approach I am able to both filter the documents and get most frequent terms among single-token values without loading all the groups to the memory. The main disadvantage is the degradation of indexing performance - in my case indexing runs two times longer when adding a FacetField to each StringField. Is there any other or better way to get the most frequent single-token field values with the ability to filter documents in Lucene index? Thanks.