I have been thinking about our initial idea to use DisjunctionMaxQuery (aka DisMax) with MoreLikeThis instead of the Boolean query we have today.
## Definition and landscape DisMax lets you amongst a set of subqueries under a SHOULD clause boost the matching documents up to the score of the highest subquery (and not add up the score of each). A concrete use case is as followed. If the query is "albino elephant" this ensures that "albino" matching one field and "elephant" matching another gets a higher score than "albino" matching both fields. Each term (albinos and elephant) has a DisMax query where the subqueries are a term query for each targeted field. Then both DisMax queries are joined with a regular boolean query. In peusdo HSearch query DSL it would look like: .bool() .should( .dismax() .should( .keyword().onField("title").matching("Albinos") ) .should( .keyword().onField("description").matching("Albinos") ) ) .should( .dismax() .should( .keyword().onField("title").matching("Elephant") ) .should( .keyword().onField("description").matching("Elephant") ) ) ## More Like This (aka MLT) Our more like this algorithm does the following. - look for the term vectors of a document i - for each field contained in document i (or a subset) - find the most popular terms the field f of document i - build a boolean query with the most popular terms on field f - combine these boolean queries per field into a bigger boolean query The original Lucene more like this algorithm is a bit different in the sense that it does not look for popular terms *per field* but rather look for an all star popular term for document i and then build a boolean query with the most popular term for each field. ## More Like This and DisMax With our MLT approach, terms between fields are not necessarily shared. In fact they are only looked for if they belong to the field f of document i in the first place. I don't see how DisMax would be of any use for us as we don't have a common set of terms that we look for across several fields. At least not to solve the now famous albinos elephant problem. We could use Dismax for the final top boolean query. The effect would be that documents are scored up to the highest lookalike-factor of their best field as opposed to the cumulated lookalike-ness of each field. Is that desirable? It does not look like it. I would naturally use boost factors between fields to express their respective importance but still want to find matching documents across all fields. Thoughts? ## DisMax and our current keyword matching It would make some sense I think to offer DisMax for our current keyword matching queries. .keyword().onFields("title", "description").matching("Albinos Elephant") In this case **and assuming the same analyzer for both fields**, we could use DisMax to essentially do .bool() .should( .dismax() .should( keyword().onField("title").matching("Albinos") ) .should( keyword().onField("description").matching("Albinos") ) ) .should( .dismax() .should( keyword().onField("title").matching("Elephant") ) .should( keyword().onField("description").matching("Elephant") ) ) I am not sure how we would call that effect? - .favorMultipleKeywordMatching() - .decreaseCrossFieldKeywordImportanceBy(90%) //this number is 1 - DisMax tieBreakMultiplier for the curious ; 100% is what I have described above ## DisMax as top level DSL feature Should we add .dismax() like we did bool()? I am hard pressed to find a use case. Emmanuel _______________________________________________ hibernate-dev mailing list hibernate-dev@lists.jboss.org https://lists.jboss.org/mailman/listinfo/hibernate-dev