Ignacio Vera created LUCENE-8888:
------------------------------------
Summary: Improve distribution of points with data dimension in BKD
tree leaves
Key: LUCENE-8888
URL: https://issues.apache.org/jira/browse/LUCENE-8888
Project: Lucene - Core
Issue Type: Improvement
Reporter: Ignacio Vera
In LUCENE-8688 it was introduce a new storing strategy for leaves contains
duplicated points. This works well with indexed dimension as the process of
partition the space and the final sorting of leaves groups points with equal
indexed dimensions.
This is not the case all the time if the point contain data dimensions. It
might happen that if two points have the same indexed dimensions but different
data dimensions, the distribution on the leaves is not the most optimal.
A good example is if a user tries to index a bounding box using LatLonShape.
The resulting tessellation of a bounding box is two triangles with the same
indexed dimensions but different data dimensions. If there are two documents
indexing the same bounding box, the result in the leaf is the triangles from
one document followed by the triangles of the second document. This is because
the current sorting/selection algorithms use one indexed dimension and
tie-break on the
docID.
The most optimal distribution in the case above is two group together the equal
triangles. Therefore what it is propose here is to update the selection/
sorting algorithms to use the data dimensions when they exist as tie-breakers
before using the docID.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]