[
https://issues.apache.org/jira/browse/LUCENE-5579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
David Smiley updated LUCENE-5579:
---------------------------------
Attachment: LUCENE-5579_SPT_leaf_covered.patch
This first patch is the first step / phase, which differentiates "covered" leaf
cells (cells that were within the indexed shape) from other cells which are
approximated leaf cells. The next phase will be augmenting the Intersects
filter and possibly others to collect exact hits in conjunction with the fuzzy
hits.
During the benchmarking I learned some interesting things:
* Quad tree is 50% of the size of Geohash! This observation is for non-point
data, since that's what's relevant to all this hit-confirmation business / leaf
cells. For point data, it'd be the other way around.
* Leaf pruning shaves 45%! So much for my plans to phase that out -- it's key.
* Differentiating leaf types (Covered vs Approximated) add 4%.
* A more restrained leaf pruning that doesn't prune covered leaves larger than
those at the target/detail level yields 36% shaving (not as good as 45% --
expected). That is... we're adding these covered leaf bytes to subsequently
make exact results checking better so we don't want to be too liberal in
removing them. There's a trade-off here.
The attached patch includes some refactoring to share common logic between
Contains & AVPTF (the base of Within, Intersects, and heatmap). I need to add
a configurable flag to indicate if leaves should be differentiated in the first
place, since you might not want that, and another flag to adjust how much
pruning of the covered leaves happens. Both flags should be safe to change
without any re-indexing; it could be changed whenever. Obviously if you don't
have the covered leaf differentiation then you won't get the full benefit later
when we have exact match collection, just partial.
> Spatial, enhance RPT to differentiate confirmed from non-confirmed hits, then
> validate with SDV
> -----------------------------------------------------------------------------------------------
>
> Key: LUCENE-5579
> URL: https://issues.apache.org/jira/browse/LUCENE-5579
> Project: Lucene - Core
> Issue Type: New Feature
> Components: modules/spatial
> Reporter: David Smiley
> Attachments: LUCENE-5579_SPT_leaf_covered.patch
>
>
> If a cell is within the query shape (doesn't straddle the edge), then you can
> be sure that all documents it matches are a confirmed hit. But if some
> documents are only on the edge cells, then those documents could be validated
> against SerializedDVStrategy for precise spatial search. This should be
> *much* faster than using RPT and SerializedDVStrategy independently on the
> same search, particularly when a lot of documents match.
> Perhaps this'll be a new RPT subclass, or maybe an optional configuration of
> RPT. This issue is just for the Intersects predicate, which will apply to
> Disjoint. Until resolved in other issues, the other predicates can be
> handled in a naive/slow way by creating a filter that combines RPT's filter
> and SerializedDVStrategy's filter using BitsFilteredDocIdSet.
> One thing I'm not sure of is how to expose to Lucene-spatial users the
> underlying functionality such that they can put other query/filters
> in-between RPT and the SerializedDVStrategy. Maybe that'll be done by simply
> ensuring the predicate filters have this capability and are public.
> It would be ideal to implement this capability _after_ the PrefixTree term
> encoding is modified to differentiate edge leaf-cells from non-edge leaf
> cells. This distinction will allow the code here to make more confirmed
> matches.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]